PERSPECTIVES  ON  HOLISTIC  SCORING: 
THE  IMPACT  OF  MONITORING  ON  WRITING  EVALUATION 


By 
WILLA  BUCKLEY  WOLCOTT 


A  DISSERTATION  PRESENTED  TO  THE  GRADUATE  SCHOOL 

OF  THE  UNIVERSITY  OF  FLORIDA  IN  PARTIAL  FULFILLMENT 

OF  THE  REQUIREMENTS  FOR  THE  DEGREE  OF 

DOCTOR  OF  PHILOSOPHY 


UNIVERSITY  OF  FLORIDA 
1989 


Copyright  1989 

by 

Willa  Buckley  Wolcott 


To  Edward, 
Whose   understanding  and  help  made   this  project  possible, 


and  to 


Kedron,    Charnley,    Roma,    Sharon,    Bill,    and  Janet, 
Who  encouraged  me   throughout. 


ACKNOWLEDGMENTS 

I  would  like  to  thank  the  many  people  who  assisted  me 
with  this  study.  I  was  most  fortunate  to  have  had  the 
commitment,  expertise,  and  interest  of  17  outstanding 
participants  in  this  study.  Daniel  Kelly,  University  of 
Florida,  served  as  chief  reader  and  Anita  Doyle,  retired, 
Alachua  County  Schools,  served  as  associate  chief  reader.  The 
readers  and  table  leaders  were  as  follows:  Kent  Beyette, 
University  of  Florida;  Janet  Fisher,  Jacksonville  University; 
Noelle  Geiger,  Santa  Fe  Community  College;  Kay  Gonsoulin, 
Buchholz  High  School;  Anthe  Hoffman,  University  of  Florida; 
Gail  Kanipe,  Gainesville  High  School;  Donald  Kanipe,  Santa  Fe 
Community  College;  Robert  Lauriault,  University  of  Florida; 
Patrick  McMahon,  Tallahassee  Community  College;  Daniel 
McPhail,  Newberry  High  School;  Wendy  McPhail,  Gainesville  High 
School;  Mary  Morgan,  Gainesville  High  School;  Elizabeth 
Novinger,  Tallahassee  Community  College;  Vincent  Puma,  Flagler 
College;  and  Donald  Tighe,  Valencia  Community  College.  I 
appreciate  the  help  of  William  Wood  in  handling  the  logistics 
for  the  scoring,  and  I  also  appreciate  the  help  of  the 
holistic  scorers  at  the  Tallahassee  scoring  site  who  pilot 
tested  the  questionnaire  during  the  fall  of  1988. 

I  am  grateful  to  Denise  Standiford,  Eastside  High  School, 
for  independently  validating  the  logs  in  this  study,  and  to 


the  following  people  who  assisted  me  with  the  statistical 
programs:  Robert  Baskin,  Center  for  Instructional  and  Research 
Computing  Activities;  Anne  DePalma,  Department  of  Foundations; 
and  David  Miller,  Assistant  Professor,  Department  of 
Foundations .  I  am  indebted  to  Carolyn  Lyons  for  transcribing 
the  tapes  and  typing  the  final  manuscript.  I  am  also  grateful 
to  Jeaninne  Webb,  Director  of  the  Office  of  Instructional 
Resources,  and  to  my  colleagues  Sue  Legg,  Associate  Director, 
and  Dianne  Buhr,  Assistant  Director  of  Testing  and  Evaluation, 
for  their  encouragement . 

Finally,  I  would  like  to  thank  my  committee  members  for 
all  their  help:  Ruthellen  Crews,  my  chair,  who  guided  me  and 
believed  in  me  enough  to  let  me  undertake  this  study;  Margaret 
Early,  who  asked  insightful  questions;  and  Robert  Wright, 
Forrest  Parkay,  and  Sandra  Damico,  who  provided  support  and 
assistance  along  the  way. 


TABLE  OF  CONTENTS 

Page 

ACKNOWLEDGMENTS  iv 

LIST  OF  TABLES  viii 

LIST  OF  FIGURES  ix 

ABSTRACT  x 

CHAPTERS 

1  INTRODUCTION  1 

Nature  of  the  Problem  2 

Purpose  of  the  Study  2 

Rationale  for  the  Study 4 

Significance  of  the  Study  11 

Limitations  of  the  Study  15 

Definition  of  Terms 15 

Organization  of  the  Report  18 

2  REVIEW  OF  SELECTED  LITERATURE  19 

Conditions  and  Procedures  of 

Holistic  Scoring  19 

The  Effectiveness  of  Holistic 

Scoring  in  Comparision  to  Other 

Evaluation  Systems   29 

Factors  Involved  in  the  Evaluation 

of  Writing 37 

Summary  of  Literature  Review  57 

3  METHODOLOGY  60 

The  Monitored  Holistic  Scoring  60 

The  Questionnaire  70 

The  Unmonitored  Holistic  Scoring  72 

Methods  Used  for  Analyzing  Data  73 


TABLE  OF  CONTENTS — (Continued) 

Page 

4  RESULTS  AND  DISCUSSION  78 

Mean  Scores  in  the  Two  Scoring 

Conditions 78 

Interrater  Reliability 87 

Impact  of  Chief  Readers 

on  Scoring 90 

Writing  Criteria  Across  Different 

Scoring  Levels  99 

Patterns  Among  Readers 113 

Nature  of  the  Monitoring 142 

5  SUMMARY  AND  CONCLUSIONS  164 

Procedures  Used 165 

Issues  Explored 166 

Discussion 173 

Recommendations  for  Research 176 

Conclusion 179 

APPENDICES 

A  FORMS  AND  QUESTIONNAIRE  182 

B  SCORING  CHARTS  AND  ADDITIONAL  FIGURES  192 

REFERENCES  200 

BIOGRAPHICAL  SKETCH  206 


LIST  OF  TABLES 


TABLES  Page 


1.  Descriptive  Statistics  for  the 

Stratified  Random  Sampling  64 

2.  Analysis  of  Variance  Mixed  Models 

Source  Table  79 

3.  Cronbach's  Alpha  of  Readers'  Scores  88 

4.  Cronbach's  Alpha  of  Readers'  and 

Chief readers '  Scores  94 

5.  Results  of  the  Questionnaire,  Part  III  112 

6  .   Results  of  Questionnaire  on  Biases 

and  Preferences  134 

7  .   Questionnaire  Results  Dealing 

with  Training  158 

8.   Additional  Questions  for  12  Table 

Leaders  and  Chief  Readers  161 


LIST  OF  FIGURES 


FIGURES  Page 

1 .  Monitored  vs .  unmonitored  scores  for 

overall  group 81 

2 .  Monitored  vs .  unmonitored  scores  of 

individual  readers 82 

3  .  Monitored  vs .  unmonitored  scores : 

Score  type  summary 91 

4  .  Results  of  rangef inders  and  samples 

in  the  monitored  scoring 97 

5.  Category  of  comments  assigned  at  each 

score  level 105 

6.  Summary  of  points  assigned  to  writing 

criteria  by  individual  readers  137 

7 .  Readers '  ratings  of  importance  of 

criteria  in  timed  writings 138 

8.  Summary  of  points  assigned  to  writing 

criteria  by  chief  readers  and 

table  leaders 145 

9 .  Ratings  by  chief  readers  and  table 

leaders  of  importance  of  criteria 

in  timed  writing 146 

B-l .  Account  of  procedures 192 

B-2 .  Summary  of  comments  for  paper  034 193 

B-3 .  Summary  of  comments  for  paper  088 196 

B-4 .  Results  of  table  leaders '  independent 

scoring  of  training  samples  199 


Abstract  of  Dissertation  Presented  to  the 

Graduate  School  of  the  University  of  Florida 

in  Partial  Fulfillment  of  the  Requirements  for 

the  Degree  of  Doctor  of  Philosophy 

PERSPECTIVES  ON  HOLISTIC  SCORING: 
THE  IMPACT  OF  MONITORING  ON  WRITING  EVALUATION 

By 

Willa  Buckley  Wolcott 

December  1989 

Chairman:   Ruthellen  Crews 

Major  Department:   Instruction  and  Curriculum 

The  purpose  of  this  study  was  to  examine  the  impact 
of  monitoring  on  the  reliability  and  validity  of  holistic 
scoring  as  a  means  to  evaluate  writing.  Six  questions  were 
addressed  concerning  (a)  mean  scores  in  monitored  versus 
unmonitored  settings,  (b)  agreement  among  readers,  (c)  the 
chief  readers'  influence,  (d)  criteria  for  scoring, 
(e)  patterns  in  readers'  responses,  and  (f)  the  nature  of 
the  monitoring. 

Qualitative  and  quantitative  measures  were  used.  In  an 
unmonitored  scoring,  eight  experienced  holistic  scorers 
with  different  teaching  backgrounds  rated  over  50  expos- 
itory essays  written  by  college  students  for  a  state 
assessment  program;  the  scorers  recorded  written  responses 
to  each  essay  in  logs.  Another  four  special  readers  rated 
a  subset  of  the  essays  and  orally  responded  to  these  papers 
through  the  use  of  taped  protocols .  Later  two  chief 
readers   conducted  a  monitored  scoring  with  customary 


training  procedures;  three  table  leaders  each  monitored 
four  readers  throughout.  The  readers  scored  another  50+ 
essays,  matched  with  the  first  set  on  the  basis  of  original 
scores  awarded  during  the  actual  scoring  two  years 
previously.  All  participants  completed  a  questionnaire 
devised  for  this  study. 

A  mixed-model  ANOVA  for  nested  factors  and  repeated 
measures  revealed  a  significant  difference  (p  <  .00005)  in 
the  mean  scores  assigned  the  matched  essays  in  the  two 
conditions;  lower  scores  occurred  in  the  monitored  setting. 
The  ANOVA  also  showed  significant  differences  (p  <  .001) 
among  the  eight  readers:  These  individual  differences  were 
reflected  in  the  participants'  logs  and  audiotapes.  When 
two  separate  Cronbach's  alphas  were  run  with  the  chief 
readers'  scores  included  in  the  second  alpha,  the  reli- 
ability was  consistently  over  .91.  However,  additional 
data  suggested  not  only  that  readers'  scores  more  closely 
approximated  the  chief  readers '  ratings  in  the  monitored 
setting  than  in  the  unmonitored,  but  also  that  the 
potential  for  fewer  noncontiguous  scores  existed  in  the 
monitored  setting.  Thus,  monitoring  appeared  effective  in 
increasing  readers'  reliability. 

More  significantly  in  terms  of  validity,  the  data 
showed  readers  clearly  responding  to  rhetorical,  as  well  as 
to  mechanical,  elements.  Furthermore,  all  participants 
indicated  that  they  assented  to  the  holistic  scoring 
standards  and  that  they  perceived  monitoring  as  a  helpful 
resource. 


CHAPTER  1 
INTRODUCTION 


With  the  growth  of  writing  assessment,  holistic 
scoring  has  become  widespread  as  one  method  for  the  large- 
scale  evaluation  of  students'  essays.  Sometimes  called 
general  impression  scoring  after  the  work  of  Diederich, 
French,  and  Carlton  (1961),  holistic  scoring  is  based  on 
the  premise  that  the  whole  is  more  than  the  sum  of  its 
parts  (Myers,  1980).  Holistic  scoring  requires  readers  to 
read  an  essay  quickly  but  completely,  mentally  rank 
ordering  the  essay  as  it  compares  in  overall  quality  to 
other  papers,  and  then  to  assign  a  score  accordingly.  In 
the  course  of  the  reading,  holistic  readers  make  no 
comments  on  the  paper,  nor  do  they  tally  any  errors. 
Rather,  they  rate  each  paper  in  terms  of  its  overall 
quality  and  record  the  score  in  coded  form  so  that  other 
readers  will  not  know  their  ratings.  Sample  essays,  which 
are  selected  from  the  same  test  administration  as  being 
representative  of  each  point  on  a  scoring  scale,  provide 
the  readers  with  a  frame  of  reference.  In  some  modified 
forms  of  this  procedure,  a  descriptive  set  of  scoring 
criteria  gives  the  reader  an  additional  guide. 


Nature  of  the  Problem 

The  role  that  training  and  monitoring  play  in  helping 

holistic  scorers  make  these  overall  writing  evaluations  is 

yet  to  be  fully  explored.   The  importance  of  such  training 

is  implied  in  Spandel  and  Stiggins '  (1981)  observation  that 

training  can  help  to  eliminate  biases  on  the  part  of 

readers;  it  is  further  emphasized  by  White  (1985)  who  notes 

as  follows: 

The  training  of  readers,  or  x calibration'  as  it 
is  sometimes  called,  is  not  indoctrination  into 
standards  determined  by  those  who  know  best  (as 
it  is  too  often  imagined  to  be)  but  rather  the 
formation  of  an  assenting  community  that  feels  a 
sense  of  ownership  of  the  standards  and  the 
process,  (p.  164)* 

Despite  the  seeming  importance  that  these  comments 

attach   to   training,   the   actual   influence   that   such 

monitoring  may   have   on   readers'   judgments   within   a 

structured  holistic  scoring  has  received  little  attention. 

Purpose  of  the  Study 

The  purpose  of  this  study  was  to  explore  how  the 
training  and  monitoring  which  readers  undergo  during  a 
formal  holistic  scoring  influence  the  writing  judgments 
they  make.  By  means  of  logs,  holistic  scores,  protocol 
analyses,  and  the  metacognitive  awareness  of  the  partici- 
pants themselves  as  shown  through  a  questionnaire,  it  was 


*From  Teaching  and  Assessing  Writing  by  Edward  White,  1985, 
San  Francisco:   Jossey-Bass.   Copyright  1985  by  Jossey- 
Bass.   Reprinted  by  permission. 


expected  that  the  study  would  reveal  to  what  extent  the 
nature  of  this  training  embodies  the  "interpretive 
community"  described  by  White  as  reflective  of  Fish's 
(1980)  reader  response  theory. 

The  following  questions  were  addressed: 

1.  Do  the  mean  scores  for  the  essays  differ  when  the 
papers  are  evaluated  by  readers  working  in  a 
monitored  setting  from  when  the  papers  are  judged 
by  the  readers  working  independently? 

2 .  Do  experienced  readers  participating  in  a  monitored 
scoring  achieve  greater  agreement  with  each  other 
than  when  they  evaluate  essays  independently? 

3.  What  impact  do  the  chief  readers  have  on  an 
holistic  scoring?  How  do  they  ensure  both  a 
reliable  and  a  collegial  reading? 

4.  What  criteria  do  readers  use  in  assigning  different 
score  levels  as  reported  through  their  logs, 
talking  protocols,  and  responses  to  a  questionnaire 
devised  for  this  study?  What  standards  are 
reflected  in  the  score  levels  assigned  across 
essays?  How  do  readers  respond  to  these  standards? 

5 .  Do  any  common  patterns  appear  in  the  scorers ' 
written  or  audiotaped  responses  to  the  essays ,  or 
do  their  comments  underscore  the  individuality  of 
each  reader's  transaction  with  the  text?  Do 
readers'  holistic  judgments,  as  shown  by  their 
written  or  verbal  responses ,  correspond  to  the 
writing  features  they  rate  as  important  on  a 
questionnaire? 

6 .  What  is  the  nature  of  the  monitoring  that  the 
readers  receive  during  a  scoring  as  reported 
through  the  logs  of  table  leaders  and  readers?  Do 
the  procedures  noted  in  these  logs,  together  with 
the  protocols  of  the  special  readers,  support  the 
readers'  perceptions  of  their  own  holistic  scoring 
processes  as  noted  on  their  responses  to  a 
questionnaire? 


Rationale  for  the  Study 

Training  and  monitoring  comprise  an  essential  part 
of  a  formal  holistic  scoring.  In  fact,  the  key  to  a 
successful  holistic  scoring  lies  precisely  in  its  system  of 
"checks  and  balances"  that  ensures  the  greater  likelihood 
of  readers  rating  the  same  paper  comparably.  According  to 
procedures  established  by  the  Educational  Testing  Service, 
one  check  lies  in  the  structured  format  of  the  reading 
itself.  Readers  assemble  as  a  group  to  do  the  reading, 
hence  reemphasizing  the  need  for  working  toward  a  group 
consensus.  They  read  at  tables  directed  by  table  leaders, 
who  gently  function  as  consultants,  guides,  and  monitors. 
The  table  leaders,  in  turn,  are  guided  and  monitored  by  a 
head  table  consisting  of  a  chief  reader  and  associate  chief 
readers . 

A  second  check  lies  with  the  ongoing  nature  of  the 
monitoring  provided.  Not  only  do  new  readers  undergo  a 
preliminary  training  session  in  which  they  experiment  with 
the  holistic  approach  on  old  essays,  but  in  subsequent 
scoring  sessions,  new  and  experienced  readers  alike  receive 
additional  practice.  Each  scoring  is  introduced  with 
general  comments ,  in  which  information  is  provided  about 
the  examinees'  testing  conditions,  and  the  readers  are 
reminded  to  avoid  potential  problems  with  length, 
handwriting,  and  other  surface  features.    Through  this 


procedure  the  chief  reader  establishes  expectations  for  the 
readers,  expectations  which,  as  Freedman  and  Calfee  (1983) 
hypothesize,  alter  the  "text  image"  (p.  94)  that  readers 
create  in  their  own  minds  before  making  a  judgment.  Then 
readers  begin  by  working  with  an  old  and  new  set  of 
rangefinders — that  is,  with  the  anchor  papers  selected 
beforehand  as  illustrative  of  various  scoring  levels  for 
that  test  administration.  Together  with  the  operational 
definitions  that  are  usually  included  in  a  modified 
holistic  scoring,  the  guidelines  and  rangefinders  provide 
the  criteria — both  explicit  and  implicit — against  which 
readers  can  rate  the  essays . 

The  monitoring  continues  throughout  a  scoring,  as 
sample  papers  are  used  after  each  rest  break  to  ensure 
the  adherence  of  all  readers  to  group  standards;  because 
the  tally  of  sample  scores  is  publicly  recorded,  the 
readers  are  able  to  see  where  their  own  scores  place. 
Finally,  frequent  "check  readings"  are  conducted  in  which 
a  random  sample  of  current  essays  from  each  table  is 
independently  evaluated  by  a  reader,  a  table  leader,  and  a 
chief  reader.  If  any  of  the  papers  receive  discrepant 
scores — that  is,  a  noncontiguous  score  such  as  1-3,  2-4,  or 
1-4  on  a  four-point  scale — then  the  paper  is  returned  to 
the  party  whose  score  is  discrepant,  and  the  reader  is 
asked  to  review  the  essay.  If  the  reader  is  unwilling 
to  adjust  the  score  after  reviewing  the  paper,  then  it  is 


subsequently  refereed.  Thus,  in  a  formal  scoring,  training 
adds  another  dimension  to  the  complexity  of  writing 
evaluation.  These  issues  of  training  and  monitoring  are 
critical,  for  they  affect  both  how  and  why  scorers  react 
the  way  they  do  to  essays. 

As  a  tool  of  writing  evaluation,  holistic  scoring 
attracts  both  strong  support  and  serious  criticism;  the 
training  of  readers  for  the  purpose  of  achieving  a  scoring 
consensus  lies  at  the  heart  of  the  conflict.  For  example, 
in  support  of  holistic  scoring,  Davis,  Scriven,  and  Thomas 
(1987)  point  out  that  the  reliability  and  the  relatively 
low  cost  of  holistic  assessment  make  it  valuable  for 
evaluating  a  school's  writing  program.  Bamberg,  too, 
(1982)  argues  that  writing  programs  which  focus  on  the 
writing  process  should  use  essays  for  evaluation  purposes, 
adding,  "Holistically  scored  essays  should,  therefore, 
play  a  leading  role  in  assessments  of  writing  programs 
and  writing  competence"  (p.  406).  Cooper  (Cooper  &  Odell, 
1977)  emphasizes  as  well  the  high  interrater  reliability 
that  can  be  achieved  in  holistic  scoring;  he  stresses  that 
with  similar  backgrounds  and  training,  raters  can  obtain 
substantial  agreement  in  scoring  several  essays  of  a 
student . 

In  addition,  the  theoretical  assumptions  behind 
holistic  scoring  receive  strong  endorsement  as  White  (1985) 
underscores  the  value  of  examining  papers  as  a  whole.   He 


observes  that  holisticism  "is  the  most  obvious  example  in 

the  field  of  English  of  the  attempt  to  evoke  and  evaluate 

wholes  rather  than  parts ,  individual  thought  rather  than 

mere  socialized  correctness"   (p.   19).    White  readily 

acknowledges  the  limitations  of  holistic  scoring,  pointing 

out  that  this  evaluation  approach  is  unable  to  provide 

diagnostic  information  for  individual  students  and  that, 

moreover,   the   scores   represent   rankings   rather   than 

absolute  values.   But  while  stressing  the  need  for  using 

this  approach  responsibly,  White  defends  the  underlying 

principles  behind  this  form  of  evaluation: 

Holistic  scoring  is  important  for  reasons  beyond 
measurement,  for  reasons  that  return  us  to  the 
nature  of  writing  and  to  the  importance  of  the 
study  of  writing  itself.  It  is  in  our  writing 
that  we  see  ourselves  thinking,  and  we  ask  our 
students  to  write  so  that  they  can  think  more 
clearly,  learn  more  quickly,  and  develop  more 
fully.  Writing,  like  reading,  is  an  exercise  for 
the  whole  mind,  including  its  most  creative, 
individual,  and  imaginative  faculties.  The  rapid 
growth  of  holistic  scoring  in  grading  reflects 
this  view  of  reading  and  writing  as  activities  not 
describable  through  an  inventory  of  their  parts, 
and  such  scoring  serves  as  a  direct  expression  of 
that  view:  By  maintaining  that  writing  must  be 
seen  as  a  whole  and  that  the  evaluating  of  writing 
cannot  be  split  into  a  sequence  of  objective 
activities,  holistic  scoring  reinforces  the  vision 
of  reading  and  writing  as  intensely  individual 
activities  involving  the  full  self.   (p.  32) 

In  order  for  students  to  understand  more  clearly  the 

evaluation  criteria  used  on  their  papers,  White,  in  fact, 

advocates  the  application  of  holistic  scoring  guides  in  the 

classroom.    That  many  teachers  enthusiastically  endorse 

this  practice  (Mishler  and  Hogan,  1982;  Paulis,  1985,  and 


Westcott  and  Gardner,  1984)  gives  added  weight  to  the  value 
of  holistic  scoring  as  an  aid  to  teaching  and  revising. 

At  the  same  time,  holistic  scoring  receives  pronounced 
criticism  from  some  researchers  and  scholars.  Frequently 
cited  as  emblematic  of  reader  reliability  problems  is  the 
classic  work  of  Diederich,  French,  and  Carlton  (1961).  In 
this  study  the  researchers  asked  over  50  readers  from  6 
fields  to  grade  300  essays  written  by  new  freshmen  at 
different  colleges .  All  of  the  essays  received  at  least 
five  of  the  nine  possible  scores  on  the  scale,  and  one- 
third  even  received  the  entire  range  of  scores .  Often 
overlooked  in  references  to  this  study,  however,  is  the 
absence  of  any  criteria  or  scoring  assistance  provided  for 
readers  (White,  1985);  such  a  lack  of  training  represents 
a  complete  departure  from  the  guidance  given  in  most 
holistic  scorings  of  recent  years.  Nevertheless,  even  with 
such  guidance  given,  the  issue  of  score  reliability — a 
broad  term  that  reflects  potential  error  sources  in  topics, 
tasks,  conditions  of  examinees,  and  agreement  among 
readers — remains,  according  to  Breland,  Camp,  Jones, 
Morris,  and  Rock  (1987),  the  "Achilles  Heel"  of  writing 
assessment. 

In  addition  to  questions  of  reliability,  the  issue 
of  the  validity  of  holistic  scoring  has  come  under  attack. 
For  example,  Charney  (1984)  argues  that  "a  given  set 
of  criteria  devised  by  one  set  of  experts  is  no  more  valid 


than  a  different  set  of  standards,  arrived  at  by  a 
different  group  of  experts"  (p.  73).  She  suggests, 
moreover,  that  the  very  need  for  extensive  training  in 
holistic  scoring  implicitly  illustrates  the  difficulties 
readers  experience  in  adhering  to  the  imposed  criteria. 
This  difficulty,  according  to  Charney,  is  further  shown  in 
those  studies  which  have  found  such  superficial  features  of 
writing  as  handwriting  or  spelling  to  be  influential  in 
the  holistic  scores  assigned.  Thus,  Charney  concludes, 
"Holistic  ratings  should  not  be  ruled  out  as  a  method  of 
evaluating  writing  ability,  but  those  who  use  such  ratings 
must  seriously  consider  the  question  of  the  validity  of  the 
scores  that  result"  (p.  79). 

Similar  caution  is  reflected  in  the  report  by  the 
CCC  [College  Composition  and  Communication]  Committee  on 
Teaching  and  Its  Evaluation  in  Composition  (1982);  the 
report  notes  that  holistic  scoring  is  of  "limited  value" 
for  evaluating  either  writing  programs  or  courses .  Several 
educators  and  composition  specialists  also  express  their 
concern.  Hirsch  (1977)  states,  for  example,  that  some  of 
the  greatest  thinkers  in  history  have  been  unable  to  estab- 
lish holistic  standards  which  encompass  both  intrinsic  and 
extrinsic  criteria.  Hirsch  insists  that  the  Aristotelian 
mode  of  intrinsic  evaluation,  which  judges  how  effectively 
and  how  correctly  writers  carry  out  their  intentions,  is 
better  for  predicting  writing  ability  than  is  the  Platonic 


10 


mode,  which  judges  the  quality  of  intentions  external  to 
the  writers;  the  intrinsic  mode  is  also,  he  argues,  "the 
only  kind  of  assessment  in  which  anyone  should  have 
confidence"  (p.  186). 

Elbow  (1986)  expresses  reservations  about  an  evaluation 
model  which  requires  agreement  among  judges.  Not  only  may 
it  result  in  an  overemphasis  on  such  measurable  features  as 
grammar  and  spelling,  but  it  also  requires  readers  to 
suspend  their  own  judgments  in  favor  of  other  standards. 
For  Elbow,  "descriptive  perceptions"  (p.  255) — even  when 
they  conf lict--provide  a  more  valuable  learning  experience 
than  those  evaluations  which  merely  rank  or  measure. 

The  need  for  agreement  among  holistic  raters  disturbs 
the  educator  Roberts  (1983)  as  well,  contributing,  in  his 
view,  to  a  limited,  "product-centered  and  decontextualized" 
form  of  evaluation  that  disregards  the  writer's  purpose, 
intentions,  and  environment.  Roberts  questions  whether 
holistic  scoring,  like  the  empirical  research  to  which  he 
attributes  its  growth,  can  effectively  measure  writing 
quality,  writing  change,  or  "anything  other  than  how  well 
a  writing  sample  simulates  an  Idealized  Text"  (p.  3);  his 
latter  observation  directly  contradicts  the  view  expressed 
by  Spandel  and  Stiggins  (1981)  that  holistic  scoring, 
comparing  as  it  does  the  relative  quality  of  essays,  has  no 
"preconceived  notion  of  the  'ideal'  paper"  (p.  24). 


11 


As  can  be  seen,  those  who  question  the  validity  of 
holistic  scoring  imply  that  emphasizing  agreement  of  scores 
through  training  destroys  the  individual  perspective — an 
individuality  endorsed  in  recent  years  by  such  reader 
response  theorists  as  Bleich  (1975).  But  White  (1985) 
argues  an  opposing  viewpoint  just  as  emphatically.  Calling 
attention  to  the  importance  of  the  nature  of  the  community 
that  forms  in  an  holistic  scoring,  White  compares  the  "true 
community  of  assent, "  which  is  properly  developed  through 
a  formal  essay  scoring,  to  the  "interpretive  community" 
discussed  by  another  theorist  of  reader  response,  Fish 
(1980).  Although  Fish's  concept  refers  to  the  sense  of 
agreement  readers  of  literary  texts  strive  to  attain,  White 
(1985)  sees  similarities  between  Fish's  reader  response 
theory  and  the  need  for  establishing  a  responsive  community 
in  holistic  scoring.  Thus,  a  study  is  needed  to  explore  the 
impact  of  training  both  on  the  nature  of  the  community  that 
develops  among  the  scorers  and  on  any  scoring  agreement 
that  results . 

Significance  of  the  Study 

In  revealing  the  extent  to  which  holistic  scorers 
willingly  adopt  the  criteria,  the  study  should  have  educa- 
tional, theoretical,  and  practical  significance.  First, 
as  Keech  (1982)  notes,  it  should  show  the  individual 
holistic  scorer  responding  "not  as  an  error-counter  or  a 


12 


conserver  of  threatened  forms ,   but  as  a  receiver  of 

intended  communication"  (p.  174).  Holistic  scorers  attempt 

to  derive  the  meaning  through  envisioning  each  text  as 

a  whole;  they  operate  in  a  context  in  which  they  are 

encouraged  both  to  remember  the  limited  conditions  under 

which  examinees  write  and  to  recognize  that  strong  papers 

may  not  be  perfect  ones.   As  such,  the  study  should  verify 

White's  (1985)  observation  expressed  below: 

The  simple  fact  is  that  the  definition  of  textu- 
ality  and  the  reader's  role  in  developing  the 
meaning  of  a  text  that  we  find  in  recent  theories 
of  reading  happens  to  describe  much  of  our  experi- 
ence of  responding  with  professional  care  to  the 
writing  our  students  produce  for  us.  Part  of  the 
problem  of  evaluating  student  writing  comes  out  of 
our  deep  understanding  that  we  need  to  consider 
the  process  of  writing  as  well  as  the  product 
before  us  and  that  much  of  what  the  student  is 
trying  to  say  did  not  get  very  clearly  into  the 
words  on  the  page.   (p.  93) 

The  study  should  further  serve  to  integrate  writing 

assessment  more  closely  with  reader  response  theory  by 

illuminating  whether  the  particular  sense  of  community 

that  arises  through  training  procedures  influences  the 

holistic  judgments  made.  The  early  composition  researchers 

Braddock,   Lloyd-Jones,   and   Schoer   (1963)   stress   the 

importance  of  agreeing  to  criteria.    In  a  reference  to 

analytic  scoring,  for  example,  the  researchers  link  the 

effectiveness  with  which  criteria  are  applied  directly  to 

"the  commitment  which  each  rater  feels  toward  the  criteria 

being  employed"  (p.  15). 


13 


As  indicated  by  White  (1985),  reader  response  theorist 
Fish  (1980)  attributes  the  agreement  which  can  occur  among 
readers  of  literature  to  a  "stability  in  the  makeup  of 
interpretive  communities,"  a  stability  arising  from  a 
commonality  of  goals  (Fish,  1980,  p.  15).  According  to 
Fish,  the  stability  is  due  not  to  independent  qualities 
within  the  texts,  but  rather  to  the  "interpretive 
strategies "  which  give  shape  to  the  event  of  reading  and 
hence  to  the  making  of  meaning  of  the  texts  themselves. 
The  nature  of  the  interpretive  communities  can  change 
because  the  interpretive  strategies  are  learned;  they  are 
learned  through  persuasion,  as  writers  invite  readers  to 
employ  particular  strategies . 

The  reader  response  theorist  Rosenblatt  (1985,  1988) 
finds  Fish's  view  too  narrow;  however,  she  also  underscores 
the  value  of  agreement  in  her  observation  that  "in  any 
specific  situation,  given  agreed-upon  criteria,  it  is 
possible  to  decide  that  some  readings  are  more  defensible 
than  others"  (Rosenblatt,  1985,  p.  36).  An  important 
distinction  Rosenblatt  makes  is  between  "efferent"  and 
"aesthetic"  reading.  In  efferent  reading,  the  reader  is 
concerned  with  what  can  be  taken  away  from  the  reading, 
whereas  in  aesthetic  reading,  the  reader  is  involved 
with  experiencing  the  reading  event  itself.  The  two  types 
of  reading  fall  on  a  continuum,  requiring  readers  to 
select  the  primary  elements  to  which  they  will  give  their 


14 


attention.  Of  special  significance  for  this  study  is 
Rosenblatt's  contention  that  "the  need  for  grasping  the 
author's  purpose  and  for  a  consensus  among  readers  is 
usually  more  stringent  in  efferent  reading"  (1988,  p.  8). 
As  she  elaborates,  "In  efferent  reading,  the  student  has  to 
learn  to  focus  attention  mainly  on  the  public,  referential 
aspects  of  consciousness  and  to  ignore  private  aspects  that 
might  distort  or  bias  the  desired  publicly  verifiable  or 
justifiable  interpretation"  (1988,  p.  8).  In  this  regard 
the  training  of  holistic  scorers  can  perhaps  be  perceived 
as  a  means  for  helping  readers  adopt  an  appropriate  stance 
in  which  they  overcome  their  biases  and  select  agreed-upon 
public  criteria. 

Thus,  the  issue  of  agreement  surfaces  both  in  reader 
response  theory  and  in  writing  assessment.  Because  this 
issue  lies  at  the  heart  of  questions  concerning  the 
validity  of  holistic  scoring,  this  study  should  have 
theoretical  implications  in  revealing  the  degree  of 
commitment  holistic  scorers  feel  to  the  standards  they  use. 

Finally,  the  study  should  have  practical  significance. 
If  the  monitoring  and  training  processes  of  the  structured 
scoring  have  a  noticeable  impact  on  readers '  evaluations , 
then  the  need  for  continuing  holistic  scorings  within  a 
formal  context  will  be  apparent.  If,  on  the  other  hand, 
readers'  judgments  do  not  appear  to  be  unduly  affected  or 
altered  by  the  group  monitoring  procedures,  then  an  option 


15 


might  be  to  have  experienced  readers  follow  what  is 
currently  done  in  many  state  assessments  and  score  some 
essays  at  home. 

Limitations  of  the  Study 

Any  conclusions  to  be  drawn  from  the  study  will, 
of  necessity,  be  limited,  as  the  small  number  of 
participants — 17  altogether,  including  readers,  table 
leaders,  and  chief  readers — and  the  limited  number  of 
essays  involved — a  little  over  100 — will  prevent  generali- 
zations. Furthermore,  the  scorers  involved  in  the  study 
will  be  highly  experienced  readers;  different  results  might 
be  obtained  with  less  experienced  scorers  who  might  not 
react  the  same  way,  especially  in  the  at-home  scoring. 
Finally,  the  study  relies  on  the  accuracy  of  scorers' 
self -reporting  in  logs,  taped  protocols,  and  questionnaire 
responses;  such  self-reporting  not  only  entails  subjec- 
tivity but  also,  as  Freedman  and  Calfee  indicate  (1983), 
depends  upon  the  evaluators '  abilities  to  articulate  their 
own  responses . 

Definition  of  Terms 

For  the  purpose  of  this  study,  the  following  defini- 
tions are  used: 

1 .  Scorer  and  reader  are  used  interchangeably  to 
refer  to  those  readers  making  the  rating  judgments 


16 


on  each  paper.  Special  scorer  is  used  to  refer  to 
any  of  those  four  readers  who  do  talking  protocols 
as  they  evaluate  the  papers . 

2 .  Chief  reader  or  trainer  is  used  to  refer  to  the  one 
or  two  individuals  who  conduct  the  holistic 
scorings  and  who  train  the  readers  by  providing 
sample  papers . 

3 .  Table  leader  refers  to  the  individual  who  is  in 
charge  of  a  table  of  readers  and  who  monitors  those 
readers'  progress. 

4.  Monitored  scoring,  structured  scoring,  and  formal 
scoring  are  used  interchangeably  to  refer  to  a 
formal  writing  assessment  approach  in  which  a  group 
of  readers  meets  and  follows  the  set  procedures 
described  in  this  study. 

5.  Training  and  calibration  of  readers  refer  to  the 
processes  whereby  the  readers  are  given  initial 
exposure  to  selected  sample  papers  and  ongoing 
practice  in  reading  and  scoring  those  essays . 
These  processes  also  include  the  public  tallying  of 
scores  on  those  papers . 

6.  Monitoring  refers  to  the  ongoing  process  whereby 
table  leaders  continuously  check  some  of  the  actual 
essays  the  readers  have  scored. 

7 .  Check  reading  refers  to  the  formal  process  whereby 
the  chief  readers  collect  from  the  table  leaders 


17 


two  papers  per  reader  which  the  table  leaders  have 
also  scored.  The  chief  readers  independently  score 
these  papers  and  compare  the  results . 

8 .  Rangef inders  and  anchor  papers  are  used  inter- 
changeably to  refer  to  the  six  essays  which  have 
been  formally  chosen  in  a  previous  sample  selection 
process  as  representative  of  each  scoring  level 
from  level  1  (the  lowest)  to  level  4  (the  highest) . 
These  papers,  which  the  readers  must  initially  rank 
order  according  to  quality,  serve  as  guideposts  for 
the  standards  of  any  reading. 

9 .  Sample  refers  to  additional  essays  which  have  also 
been  selected  from  the  same  previous  scoring  as  the 
rangefinders  and  which  are  used  throughout  a 
scoring  to  illustrate  particular  levels  of  scores. 

10 .  Operational  definitions  is  used  to  refer  to  those 
written  descriptors  of  each  level  of  paper  and  the 
qualities  that  the  levels  embody. 

11.  Log  is  used  to  refer  to  the  running  commentary 
readers  provide  of  their  scores  and  the  decisions 
for  these  scores.  It  is  distinguished  from  the 
term  Account  of  Procedures  which  is  used  to  refer 
to  the  customary  log  of  procedures  that  chief 
readers  maintain. 


18 


Organization  of  the  Report 

A  review  of  selected  literature  is  presented  in 
Chapter  2 .  The  methodology  used  in  the  study  is  addressed 
in  Chapter  3.  The  qualitative  and  quantitative  results  are 
discussed  in  Chapter  4,  and  a  summary  of  the  findings  and 
their  implications  are  presented  in  Chapter  5. 


CHAPTER  2 
REVIEW  OF  SELECTED  LITERATURE 


Literature  on  holistic  scoring  and  its  related  areas 
falls  into  three  broad  categories:  One  set  of  studies 
establishes  and  describes  holistic  scoring  procedures  and 
conditions,  primarily  for  the  purpose  of  improving  their 
implementation.  A  second  category  addresses  such  issues  as 
the  validity,  reliability,  and  cost-effectiveness  of 
holistic  scoring  and  explores  the  effectiveness  of  this 
evaluation  system  as  it  relates  to  other  procedures .  A 
third  category,  more  theoretical  in  nature,  explores  the 
complexities  entailed  in  making  writing  judgments. 

Conditions  and  Procedures  of  Holistic  Scoring 

The  classic  study  by  Diederich,  French,  and  Carlton 
(1961)  cited  in  the  introduction  has  influenced  both  the 
conceptual  base  and  the  method  of  holistic  scoring.  As 
noted  previously,  the  researchers  asked  readers  to  make  an 
overall  judgment  as  to  the  quality  of  a  particular  essay  by 
implicitly  rank-ordering  the  papers.  In  this  study,  53 
professionals  from  a  variety  of  fields  evaluated  about  300 
essays  written  at  home  by  college  freshmen  from  different 
universities.  The  raters  were  first  instructed  to  sort  50 
papers  into  three  piles  signifying  their  level  of  quality — 


19 


20 


average,  above-average,  and  be low- aver age.  Next,  they  were 
to  sort  each  pile  into  three  more  stacks ,  for  a  total  of 
nine.  Finally,  they  had  to  place  the  remaining  papers  in 
one  of  the  appropriate  piles  and  write  comments  about  what 
they  liked  or  disliked  in  the  essays. 

When  the  grades  assigned  to  each  paper  were  correlated, 
the  median  of  correlations  between  readers  was  .31, 
indicating  a  low  reliability  for  the  reading.  Precisely 
for  this  reason,  the  work  of  Diederich  et  al .  has  often 
been  cited  as  indicative  of  the  problems  inherent  in 
general  impression  scoring.  Yet,  as  White  (1985)  notes,  it 
is  important  to  recognize  that  the  conditions  of  their 
study  differed  considerably  from  those  typically  used 
today:  Not  only  did  the  readers  come  from  diverse 
backgrounds,  but  neither  training  nor  monitoring  was 
provided;  moreover,  the  papers  were  written  outside  class, 
a  departure  from  normal  testing  conditions . 

In  fact,  because  of  the  large  variation  that  occurred 
in  their  study,  the  authors  conclude  that  reliability  is 
crucial  if  scoring  the  essays  is  to  serve  any  important 
purpose.  They  suggest  that  readers  should  be  tested  so  that 
only  those  whose  ratings  correlate  at  .60  with  the  general 
consensus  be  allowed  to  score;  they  speculate  that  some 
training  and  directions  might  be  of  help.  Thus,  the 
discoveries  made  in  this  major  work  have  undoubtedly  been 


21 


instrumental  in  the  development  of  holistic  scoring  as  it 
is  known  today. 

A  second  influential  study  is  The  Measurement  of 
Writing  Ability  (1966)  by  Godshalk,  Swineford,  and  Coffman. 
The  authors  used  holistic  scoring  to  evaluate  each  of  five 
essays  written  by  nearly  650  grade  11  and  grade  12  students 
throughout  the  country.  Although  the  study  was  undertaken 
for  the  purpose  of  validating  multiple-choice  items  on  a 
standardized  test,  the  researchers  conclude  with  several 
recommendations  about  holistic  scoring. 

For  example,  they  note  the  importance  of  providing 
sufficient  time  for  training  early  in  a  scoring  session 
in  order  to  avoid  having  readers  assign  overly  high  scores 
in  the  beginning.  Finding  the  time  of  day  to  be  a  factor 
in  a  scoring,  they  also  emphasize  the  need  for  having 
multiple  readings  of  a  paper  done  at  different  stages  of  a 
scoring  session.  Such  a  practice  can,  according  to  the 
researchers ,  minimize  both  the  variance  among  readers  and 
the  variance  deriving  from  the  time  of  day. 

In  later  stages  of  their  study,  the  researchers  reduced 
the  number  of  readers  evaluating  any  one  essay,  and  they 
experimented  with  changing  their  original  3-point  scoring 
scale,  denoting  superior,  average,  and  inferior,  to  a 
4-point  scale.  The  even-numbered  scale  required  readers 
to  choose  the  half  of  the  scale  that  each  paper  exempli- 
fied, rather  than  resorting  to  the  safety  of  the  middle 


22 


score  whenever  in  doubt.  As  a  result  of  their  experimen- 
tation with  various  conditions ,  the  work  of  Godshalk  et  al . 
(1966)  has  formed  the  basis  for  many  formal  holistic 
scorings  conducted  today. 

More  recent  literature  reflects  continued  interest 
in  the  conditions  under  which  an  holistic  scoring  is  con- 
ducted. For  example,  Paden  (1986)  explored  the  possible 
relationship  between  the  context  in  which  an  essay  is 
placed  and  the  influence  of  floor  and  ceiling  effects 
on  the  range  of  different  score  levels.  Citing  the 
conclusions  of  other  researchers  who  failed  to  minimize 
context  effects  on  a  scoring,  Paden  hypothesized  a 
theoretical  model  to  link  the  effect  of  context  to 
the  potential  for  increase  or  decrease  contained  by  each 
score  level.  Because  the  potential  for  change  can  differ 
substantially  depending  on  whether  a  score  is  a  1  or  a  3, 
for  example,  the  particular  placement  of  a  score  on  a  scale 
can,  according  to  Paden,  affect  the  amount  of  change  that 
context  can  influence.  She  stresses  the  need  for  vali- 
dating her  hypothetical  model. 

Concern  for  training  of  the  readers  also  appears 
in  several  studies  during  this  decade.  The  role  that 
training  of  readers  can  play  has  been  illustrated  in  a 
study  by  Freedman  (1981),  who  studied  the  impact  of  three 
variables — essay,  reader,  and  environment — on  an  holistic 
scoring.   Four  highly  qualified  scorers  worked  in  pairs  to 


23 


score  holistically  64  argumentative  papers  composed  on  each 
of  eight  topics  by  college  students;  in  a  later  session, 
they  rated  the  same  essays  analytically.  Two  trainers 
trained  the  different  pairs  of  raters,  providing  the 
readers  with  sample  essays  for  each  topic. 

Freedman  (1981)  found  that  the  four  readers  graded 
the  papers  consistently  with  each  other  and  appeared  to 
be  unaffected  by  the  rating  session  and  the  time.  In 
addition,  their  holistic  scores  correlated  significantly 
with  all  the  analytic  ratings  except  for  the  area  of 
usage.  The  choice  of  topic  did  seem  to  affect  the  results 
in  that  one  opinion  topic  received  higher  scores  through- 
out. Most  important,  Freedman  also  found  an  apparent 
effect  that  the  trainer  could  have  on  the  scoring.  Even 
though  both  trainers  (one  of  whom  was  Freedman)  agreed  on 
the  scores  to  be  given  sample  essays,  on  a  replay  of  the 
taped  training  sessions,  differences  in  the  discussions 
appeared.  Thus,  while  one  trainer  might  state  that  two 
contiguous  scores  of  2  and  1  were  appropriate  for  a  given 
sample,  the  other  trainer  might  state  that  the  1  score 
was  not  suitable  for  papers  of  that  type . 

Freedman  (1981)  speculates  that  such  training  differ- 
ences can  result  in  higher  or  lower  scores  being  assigned 
accordingly,  and  she  suggests  that  researchers  in  small 
projects  avoid  conducting  the  training  themselves  in  order 
to  avoid  influencing  results  in  favor  of  their  hypotheses. 


24 


The  importance  of  training  in  writing  evaluation  was 
underscored  by  Hrach's  dissertation  (1983),  in  which  she 
explored  possible  links  between  raters'  previous  writing 
experiences,  their  tolerance  of  ambiguity,  and  their 
evaluation  approaches.  Fifty-nine  secondary  English 
teachers  sorted  into  whatever  scoring  categories  they  chose 
20  papers  written  on  the  same  topic  by  secondary  school 
students.  Using  a  three-way  multidimensional  scaling 
system,  Hrach  identified  the  basis  of  the  classifications 
to  be  style,  organization,  maturity  of  thought  and 
expression,  and  substance.  Thirty-nine  other  teachers  who 
rated  the  compositions  analytically  confirmed  the  accuracy 
of  the  classifications.  In  addition,  the  teachers  com- 
pleted two  instruments  addressing  their  experience  with 
writing  skills  and  their  tolerance  of  ambiguity.  Results  of 
Hrach's  study,  like  Diederich  et  al.'s  (1961),  confirmed 
that  raters  were  influenced  by  the  presence  or  absence  of 
certain  writing  qualities  in  essays.  Also  like  the  find- 
ings of  Diederich  and  his  colleagues,  Hrach  discovered  that 
the  raters  differed  substantially  on  what  they  considered 
important.  That  is,  only  three  raters  used  three  of  the 
four  dimensions  she  identified,  and  almost  half  focused 
on  one  dimension  alone.  These  differences  in  evaluations 
did  not  appear  related  either  to  the  raters '  previous 
experiences  with  writing  skills  or  to  their  tolerance  for 
ambiguity,  as  these  features  were  subsequently  not  found  to 
be  influential  in  writing  judgments. 


25 


Because  the  raters  were  given  neither  criteria  nor 
restrictions  for  sorting  the  papers,  Hrach  (1983) — who 
endorses  the  realism  of  this  practice — nevertheless 
suggests  that  a  lack  of  training  in  writing  evaluation 
might  explain  the  wide  variability  in  results .  She 
speculates  that  even  if  the  instructors  had  completed 
coursework  in  writing,  without  training  in  the  teaching 
or  evaluating  of  writing,  "they  probably  would  not  have 
developed  common  perceptions  of  what  constitutes  good 
writing"  (p.  171).  In  what  seems  a  forerunner  of  this 
study,  Hrach  suggests  that  it  might  be  useful  to  examine 
how  raters  trained  in  holistic  scoring  rate  papers 
independently  and  as  part  of  a  group. 

The  importance  of  training  in  an  holistic  scoring 
has  also  been  emphasized  by  Sweedler-Brown  (1985),  who 
sought  to  determine  whether  the  amount  of  training  and  the 
experience  that  holistic  scorers  had  with  a  grading  scale 
affected  either  their  evaluations  of  writing  quality  or 
the  consistency  of  the  evaluations. 

Using  a  six-point  scale,  20  experienced  writing 
instructors  and  graduate  students,  whose  experience  with 
holistic  scoring  ranged  from  none  to  three  years,  holistic- 
ally  scored  897  essays  written  by  university  students. 
From  this  group  of  essays  the  36  essays  which  had  received 
discrepant  scores  were  selected  for  analysis.  Each  of  the 
readers   involved  in  the  discrepant  scores,   together 


26 


with  one  of  the  six  trainers  who  had  served  as  referees, 
was  asked  three  days  later  to  score  the  same  essays 
analytically.  The  eight  criteria  included  content,  organi- 
zation, diction,  development,  mechanics,  and  spelling. 
When  the  holistic  scores  were  correlated  with  the  total 
analytic  scores,  the  trainers  were  found  to  give  equivalent 
holistic  and  analytic  scores  over  60%  of  the  time,  whereas 
readers  assigned  comparable  scores  only  37%  of  the  time. 
Although  both  trainers  and  readers  valued  content  and 
sentence  structure  (albeit  in  reverse  order),  the  trainers 
tended  to  give  lower  holistic  and  analytic  scores  than  did 
the  readers . 

Thus,  the  researcher  concludes,  "Graders  with  greater 
experience  and  training  have  significantly  greater 
consistency  between  their  holistic  and  analytic  evaluations 
of  the  same  essay,  from  which  we  conclude  that  the  amount 
of  training  and  experience  does  significantly  affect  the 
reliability  of  a  grader's  evaluation"  (p.  54).  The 
importance  Sweedler-Brown  (1985)  attaches  to  training  seems 
justifiable;  however,  some  limitations  in  her  study  suggest 
that  the  results  must  be  interpreted  cautiously. 

That  Sweedler-Brown ' s  (1985)  conclusion  derives  from 
a  small  sample  of  36  discrepantly  scored  papers  seems 
troublesome:  Not  all  readers  would  necessarily  have  been 
involved  in  these  discrepant  scores,  and  hence,  the  actual 
number  of  readers  from  whom  correlations  were  obtained — 


27 


while  not  clearly  stated — might  have  been  fewer  than  the 
20  doing  the  holistic  scoring.  In  addition,  the  training 
provided  for  the  analytic  scoring  was  far  more  limited  than 
that  given  to  the  holistic  criteria,  with  the  result  that 
agreement  on  the  analytic  scales  might  have  been  harder 
to  achieve.  Finally,  as  Sweedler-Brown  acknowledges,  the 
trainers  scored  far  fewer  papers  than  did  the  readers .  The 
trainers  might  have  remembered  their  original  holistic 
scores  on  the  discrepant  papers,  thereby  contributing  to 
the  higher  correlation  they  achieved  between  analytic  and 
holistic  scores.  Thus,  the  limitations  of  this  study 
militate  against  the  conclusions,  however  strongly  the  need 
for  training  in  holistic  scoring  appears  to  be. 

Differences  between  trained  and  untrained  scorers  were 
also  examined  by  Huot  (1988)  in  a  recent  dissertation 
somewhat  related  to  the  present  study.  Arguing  that  too 
much  attention  has  been  paid  to  the  issue  of  agreement, 
Huot  explored  the  validity  of  holistic  scoring  by  comparing 
the  protocols  of  four  novice  and  four  expert  holistic 
raters.  Each  scorer  rated  84  essays  selected  from  a 
previous  assessment  and  written  in  letter  format  by 
college  freshmen  on  two  different  topics.  The  scorers, 
who  talked  aloud  for  half  the  essays  (42  each),  were  given 
training  in  doing  protocol  analysis.  Then,  over  a  four-day 
period,  pairs  of  expert  readers  and  pairs  of  novice  readers 
scored  for  two  days  apiece.  The  novices  were  given  neither 


28 


training  in  holistic  scoring  nor  any  rubric  to  use;  the 
experts,  together  with  a  scoring  leader,  first  trained  with 
anchor  papers  and  the  original  rubric,  and  then  they 
modified  the  rubric  for  the  protocol  scoring. 

The  researcher  coded  the  number  of  responses  that  each 
scorer  made,  noting  when  the  responses  were  made,  whether 
the  responses  were  positive,  neutral,  or  negative,  whether 
the  responses  were  made  to  the  writer  or  to  the  essay,  and 
the  criteria  on  which  the  judgments  were  based.  After  each 
scoring  session,  the  researcher  interviewed  the  scorers. 

Huot  found  that  even  though  the  novice  raters  made 
substantially  more  comments  than  did  the  expert  raters,  the 
experts'  comments — many  of  which  were  made  after,  rather 
than  during  the  scoring — reflected  more  varied  viewpoints 
and  more  personal  engagement  with  the  student  essays. 
Because  the  experts  could  use  a  scoring  rubric  whereas  the 
novices  could  not,  Huot  concludes  that,  contrary  to  his 
expectations,  the  rubric  and  other  holistic  training 
procedures  did  not  intrude  on  the  rating  process.  Rather, 
by  providing  scorers  with  "expectations,  justification  or 
explanation"  (p.  223),  the  rubric  enabled  the  raters  to 
read  the  essays  more  fully.  Novice  raters,  on  the  other 
hand,  sought  strategies  that  would  work  with  a  specific  set 
of  papers  and  concentrated  on  evaluation  to  the  exclusion 
of  any  personal  engagement  with  the  essays.  Huot  suggests 
that  holistic  scoring  procedures,  far  from  impeding  true 


29 


reading,  "actually  promote  the  kind  of  rating  process  that 
insures  a  valid  reading  and  rating  of  student  writing"  (p. 
237)  . 

As  can  be  seen  from  this  section  of  the  literature 
review,  several  studies  have  reflected  concern  both  for 
determining  what  occurs  in  holistic  scoring  and  for 
improving  the  procedures  under  which  writing  is  holistic- 
ally  scored.  Because  of  these  concerns,  several  of  these 
works  have  become  the  reference  point  for  the  practices 
currently  used  in  a  structured  holistic  scoring. 

The  Effectiveness  of  Holistic  Scoring  in 
Comparison  to  Other  Evaluation  Systems 

A  number  of  studies  have  explored  either  the 
effectiveness  or  the  cost  efficiency  of  holistic  scoring, 
especially  as  it  relates  to  other  forms  of  writing 
evaluation.  One  such  study  was  the  early  undertaking  of 
Follman  and  Anderson  (1967),  who  randomly  assigned  five 
raters  to  use  one  of  five  evaluation  approaches  in  rating 
ten  compositions  written  by  college  students.  The  rating 
systems  included  The  California  Essay  Scale,  the  Cleveland 
Composition  Rating  Scale,  the  Diederich  Rating  Scale,  the 
Follman  English  Mechanics  Guide,  and  the  Everyman's  Scale 
in  which  the  evaluators  could  use  whatever  system  they 
wished.  All  but  two  of  the  evaluators  were  English 
education  majors  enrolled  in  the  same  English  course. 


30 


Follman  and  Anderson  found  high  correlations  among  the 
different  systems  except  for  the  Diederich  scale;  they  also 
found  high  reliability  for  each  group,  leading  the 
researchers  to  conclude  that  the  homogeneity  of  the  raters 
might  be  a  major  contributing  factor. 

In  a  later  study  Winters  (1978)  compared  four  different 
scoring  systems — one  General  Impression,  two  analytic,  and 
one  a  T-unit  analysis — to  determine  how  well  each  system 
classified  four  groups  of  students  who  had  been  previously 
placed  in  high  and  low  writing  groups  in  high  school  and  in 
college . 

After  six  high  school  and  college  teachers  were 
thoroughly  trained  in  at  least  two  of  the  scoring  systems, 
four  of  the  readers  used  each  system  to  score  80  papers. 
Interrater  reliability  was  substantial  on  all  four  systems, 
with  the  General  Impression  system  achieving  the  lowest 
rate  at  .81,  in  contrast  to  the  .99  reliability  rate  of 
the  T-unit  analysis  system.  Winters  (1978)  attributes  the 
relatively  low  reliability  of  the  General  Impression  scale 
both  to  the  fact  that  the  scorers  used  this  system  first 
and  hence  lacked  the  practice  they  subsequently  experienced 
with  the  other  systems  and  also  to  the  fact  that  the  rubric 
for  this  system  was  less  defined  than  it  was  for  any  of  the 
other  procedures . 

Of  most  concern  to  Winters  (1978)  is  her  finding  that 
in  three  of  the  four  systems — the  General  Impression 


31 


system,  the  Diederich  Expository  Scale,  and  the  CSE 
Analytic  Scale,  which  was  developed  at  the  Center  for 
the  Study  of  Evaluation — the  low  college  group  did  better 
than  did  their  high  peers .  Winters  attributes  this 
unexpected  occurrence  to  the  small  size  of  the  sample, 
the  atypical  nature  of  summer  students,  and,  most  signif- 
icantly, to  the  substantial  number  of  foreign-born  students 
in  the  college  low  group — students  whose  problems  with 
syntax  or  with  awkward  wording  might  not  be  reflected  by 
the  scoring  systems .  That  the  T-unit  did  not  discriminate 
among  the  four  groups  at  all  could,  according  to  Winters, 
be  explained  by  the  similarity  of  age  in  the  students 
of  the  study,  unlike  those  students  in  previous  research  on 
T-units . 

The  researcher  speculates  that  three  systems  are  better 
than  two  for  classifying  students'  writing,  and  she  notes 
that  a  combination  of  General  Impression  scoring,  together 
with  an  analytic  system,  seems  best.  She  concludes  that 
General  Impression  scoring,  while  not  adeguate  alone  for 
placement  procedures,  should  be  included  in  most  writing 
assessments. 

Like  Winters,  Shoaf  (1985)  also  studied  the  effective- 
ness of  two  different  methods — holistic  scoring  and  T-unit 
analysis — in  evaluating  the  writing  skill  of  high  school 
students .  An  additional  purpose  of  her  study  was  to 
determine  whether  students  gained  in  writing  proficiency 


32 


over  a  semester  and  retained  that  growth  during  the  years 
following. 

Shoaf  (1985)  had  the  students  enrolled  in  her 
sophomore-level  average  composition  class  write  a  50-minute 
pre-  and  post-test  on  the  same  topic;  one  and  two  years 
later  all  students  taking  English  wrote  on  the  same 
topic  for  a  delayed  post-test .  The  researcher  and  an 
assistant  then  tallied  the  number  of  T-units  in  388 
samples.  The  essays  were  typed,  and  a  team  of  12  scorers, 
after  undergoing  a  training  session  with  anchor  papers, 
holistically  scored  the  essays  on  a  scale  of  1-4. 

A  correlation  of  holistic  scores  with  the  T-unit 
results  proved  non-significant.  When  the  holistic  scores 
were  analyzed  for  the  four  groups  as  a  whole,  the  holistic 
scores  increased  over  the  semester  and  reflected  a  slight 
decline  on  the  delayed  post-test.  However,  when  the  T-unit 
scores  were  analyzed  for  the  same  period  of  time,  no 
significant  results  occurred. 

Thus,  Shoaf  (1985)  concludes  that  T-unit  analysis  is 
not  effective  in  evaluating  the  overall  writing  progress  of 
groups  and  that  T-unit  scoring  should  only  be  used  for 
determining  levels  of  syntactic  maturity.  Acknowledging 
that  her  study  did  not  address  the  issue  of  individual 
writing  proficiency,  she  states,  "Holistic  scoring  is  a 
useful  technique  for  determining  whether  groups  of  students 
have  made  general  progress  in  the  development  of  writing" 
(P-  67). 


33 


In  a  study  by  Bauer  (1981),  the  cost-effectiveness 
of  three  different  scoring  systems — analytic,  primary 
trait,  and  holistic — was  explored,  as  well  as  the  inter- 
reliability  and  intra-reliability  of  each  system.  Nine 
graduate  students,  none  of  whom  were  familiar  with  the 
scoring  methods,  were  divided  into  groups  of  three  and 
trained  in  one  of  the  methods .  The  graduate  assistants 
scored  118  essays  previously  written  for  the  National 
Assessment  of  Educational  Progress.  Results  indicated  that 
the  analytic  scoring  method,  which  contained  the  most 
specific  scoring  criteria  and  which  required  the  longest 
training  time,  achieved  the  strongest  inter-  and  intra- 
reliabilities .  The  holistic  scoring  method,  though  attain- 
ing the  lowest  intra-reliability  rate,  was  the  second 
strongest  of  the  three  methods  in  terms  of  inter- 
reliability.  It  also  proved  to  be  the  most  cost-efficient 
for  scoring  large  numbers  of  essays. 

Janopoulos  (1987)  explored  the  effectiveness  of 
holistic  scoring  from  still  another  perspective.  He  sought 
to  determine  how  well  holistic  scorers  comprehended 
compositions  written  by  nonnative  speakers  of  English. 
After  receiving  training  in  holistic  scoring,  12  readers 
rated  two  compositions  predetermined  as  representing  higher 
and  lower  quality.  In  the  first  rating — the  "naive" 
condition — readers  were  not  told  they  would  have  to  write 
a  recall  protocol  after  the  holistic  scoring;  in  the  second 


34 


rating — the  "focused"  condition — readers  were  told  before- 
hand that  another  recall  protocol  would  be  required. 

The  readers  operating  in  the  naive  condition  were  able 
to  recall  the  higher  text  more  clearly  than  they  did  the 
lower,  thereby  illustrating,  according  to  the  researcher, 
the  role  that  comprehension  can  play  in  raters'  holistic 
judgments.  To  his  puzzlement,  even  though  the  readers 
operating  in  the  focused  condition  recalled  more  overall 
content  than  they  did  in  the  naive  condition,  the  focused 
readers  did  not  recall  more  of  the  higher  level  text  than 
they  did  that  of  the  lower;  rather,  they  recalled  about  the 
same  amount  of  information  in  both  levels. 

Janopoulos  (1987)  attributes  the  lack  of  impact  that 
this  higher  text  seemingly  had  on  the  focused  readers  to  a 
possible  ceiling  effect  and  to  the  small  sample  size.  He 
concludes,  nevertheless,  that  holistic  scoring  is  a  valid 
way  to  assess  non-native  speakers'  writing  proficiency  in 
terms  of  the  comprehension  component.  Despite  the  problems 
Janopoulos  encountered  in  interpreting  the  results,  his 
conclusion  seems  valid  in  that  holistic  scoring  more 
closely  resembles  the  naive  condition  under  which  the 
readers  in  his  study  were  operating  than  it  does  the 
focused  condition. 

An  altogether  different  stance  toward  holistic  scoring 
appears  in  Roberts'  dissertation  (1982);  he  compared  indi- 
vidualized writing  instruction,  an  approach  he  strongly 


35 


endorses,  to  the  more  traditional  classroom  method  of 
teaching  composition  in  two  West  Virginia  colleges. 
Students  wrote  pre-  and  post-essays  on  a  topic  developed  by 
researchers  in  another  study,  they  took  the  Daly-Miller 
writing  apprehension  test  at  the  beginning  and  end  of  their 
work,  and  they  answered  three  questions  regarding  their 
view  of  writing.  Then  the  essays  were  holistically  scored 
and  studied  for  T-unit  length;  they  were  also  rated 
according  to  a  forced-choice  method.  An  increase  in  the 
T-unit  length  for  the  control  group  was  the  only  signifi- 
cant difference  that  occurred. 

Roberts  (1982)  questions  the  effectiveness  of  holistic 
scoring  as  a  means  of  evaluating  the  quality  of  student 
writing.  He  notes  that  one  of  his  four  raters  dropped  out 
of  the  study  altogether,  unwilling  to  rate  "Themes  as 
Products"  (p.  96),  and  he  points  out  that  still  another 
rater  failed  to  achieve  acceptable  reliability.  Roberts 
observes,  "All  of  the  raters  commented  that  the  evaluation 
techniques  required  product-centered  evaluation  based  on  an 
artificial  rubric  that,  while  developed  specifically  for 
the  essay  topics  by  prominent  researchers,  was  inadequate 
for  evaluating  what  the  papers  really  deserved,  based  on 
what  the  raters  perceived  as  the  students'  intentions" 
(p.  96). 

Although  Roberts'  (1982)  disillusionment  with  holistic 
scoring  may  be  warranted,  his  use  of  both  a  scoring  rubric 


36 


and  a  topic  from  a  different  testing  program  is  trouble- 
some; as  White  (1985)  suggests,  the  requirements  of  each 
testing  population  and  program  must  be  taken  into  consider- 
ation in  the  development  of  an  holistic  guide.  Moreover, 
Roberts'  study  was  problematic  in  that  he  controlled  only 
for  the  instructional  mode  and  not  for  such  other  variables 
as  teacher  differences  or  course  content.  Although  Roberts 
dismissed  this  lack  of  control  by  stating  that  his  study 
was  primarily  naturalistic  rather  than  experimental,  he  did 
not  provide  the  extensive  descriptions  or  observational 
data  often  associated  with  naturalistic  studies.  There- 
fore, despite  the  limitations  which  holistic  scoring 
admittedly  has,  the  problems  in  Roberts'  dissertation 
weaken  the  impact  of  his  criticism  of  this  scoring  method. 
Taken  together,  the  studies  by  Winter  (1978),  Shoaf 
(1985),  Bauer  (1981),  Janopoulos  (1987),  and  Roberts 
(1982)  illustrate  the  potential,  as  well  as  the 
limitations,  of  holistic  scoring  for  writing  evaluation. 
Their  findings  suggest  that  holistic  scoring  is  more 
meaningful — albeit  somewhat  less  reliable  and  more  time- 
consuming  in  terms  of  training  required — than  is  T-unit 
analysis  as  a  means  of  assessing  overall  writing  quality, 
including  the  writing  of  non-native  speakers  of  English. 
It  is  also  a  cost-efficient  approach  for  the  large-scale 
assessment  of  essays.  At  the  same  time,  as  Winters  (1978) 
points  out,   holistic  scoring  cannot  reveal   specific, 


37 


diagnostic  information  and  hence,  the   purposes  for  which 
it  is  used  must  be  clearly  defined  beforehand. 

Factors  Involved  in  the  Evaluation  of  Writing 

A  third  major  component  of  the  literature  review 
encompasses  those  studies  that  explore  the  elements 
involved  in  the  evaluation  of  writing.  The  focus  of  this 
section  is  not  on  holistic  scoring  per  se  but  rather  on 
the  larger  issue  of  writing  quality — and  most  importantly, 
on  those  elements  that  comprise  that  quality.  Thus, 
studies  which  address  writing  in  a  variety  of  contexts 
are  included,  as  are  studies  which  use  assessment  methods 
other  than  holistic  scoring.  The  studies  are  primarily 
categorized  according  to  results  although,  of  necessity, 
some  overlapping  among  the  categories  occurs . 

Content  and  Organization 

The  importance  of  content  is  emphasized  by  Diederich 

(1974),  who,  in  discussing  the  factor  analysis  that  was 

performed  in  the  earlier  study  of  Diederich,  French,  and 

Carlton  (1961),  states: 

Then  it  became  quite  clear  that  the  largest 
cluster  .  .  .  was  most  influenced  by  the  ideas 
expressed:  their  richness,  soundness,  clarity, 
development,  and  relevance  to  the  topic  and  the 
writer's  purpose.  .  .  .  Hence  we  must  accept  it  as 
a  fact  that  a  high  proportion  of  intelligent, 
educated  adults  do  pay  attention  to  the  quality, 
development,  support,  and  relevance  of  the  ideas 
expressed  in  student  compositions  and  weight  them 
heavily  in  their  judgment  of  the  general  merit  of 
these  papers.   (p.  7) 


38 


Support  for  Diederich's  views  comes  from  two  other 
studies  in  which  content  and  organization  proved  to  be 
significant  determiners  of  writing  quality.  For  example, 
Freedman  (1979)  undertook  to  find  which  essay  character- 
istics influenced  judges  most  by  rewriting  four  essays 
on  each  of  eight  topics  composed  by  college  freshmen.  The 
essays  were  rewritten  to  be  strong  or  weak  in  the  four 
broad  categories  of  content,  organization,  sentence 
structure,  and  mechanics;  then  they  were  typed. 

Unaware  of  the  rewriting  that  had  been  done,  12 
instructors  of  a  college  freshman  English  program  holis- 
tically  scored  the  papers  and  subsequently  rated  the  papers 
according  to  their  perceptions  of  the  strength  or  weakness 
of  the  papers  in  each  category.  An  analysis  of  variance 
revealed  that,  as  Diederich  (1974)  had  also  found,  essays 
with  stronger  content  received  higher  scores  than  did  those 
with  weaker  content.  Organization  also  proved  to  be  a 
statistically  significant  factor.  Mechanics  appeared  to  be 
influential  as  well  in  those  papers  with  strong  organi- 
zation. When  the  perceptions  of  the  evaluators  toward  the 
rewritten  versions  were  examined,  interestingly,  the 
evaluators  did  not  always  agree  with  the  rewriters  as  to 
the  strength  or  weakness  of  the  categories  of  content 
and  organization.  In  fact,  two  readers  were  removed  from 
the  study  because  their  disagreement  was  substantial.  The 
readers  had  better  agreement  for  the  more  concrete 
categories  of  mechanics  and  sentence  structure. 


39 


Freedman  (1979)  acknowledges  as  limitations  of  the 
study  the  breadth  of  the  categories  used  for  rewriting — a 
breadth  which  made  it  impossible  to  know  what  exact 
qualities  judges  might  be  rating — and  the  homogeneity  of 
the  raters.  Like  Diederich  (1974),  she  stresses  the  need 
for  emphasizing  more  in  classroom  teaching  the  development 
and  organization  of  ideas . 

To  explore  the  criteria  that  holistic  scorers  use  in 
making  their  evaluations,  Breland  and  Jones  (1984)  compared 
scores  obtained  from  a  regular  scoring  of  the  English 
Composition  Test  (ECT)  with  analyses  made  nine  months  later 
by  20  college  English  professors  on  a  sample  of  806  essays. 
The  samples  contained  equal  numbers  of  papers  written  by 
blacks,  whites,  native  Hispanics  and  nonnative  Hispanic 
speakers  of  English.  In  the  special  scoring,  the 
evaluators  first  scored  the  papers  holistically  and  then 
checked  on  an  evaluation  form  the  strong  and  weak  features 
of  each  essay.  During  a  subsequent  session,  the  readers 
were  also  asked  to  write  on  the  essays  themselves. 

Correlational  procedures  used  to  predict  the  original 
holistic  score  indicated  that  readers  were  most  influenced 
by  the  organization,  support,  and  significant  ideas  in 
a  paper,  with  organization  correlating  the  most  highly  of 
all  discourse  characteristics  with  the  original  English 
Composition  Test  score.  Surface  features,  such  as  essay 
length,  neatness,  and  spelling,  contributed  significantly 


40 


to  predicting  the  ECT  score  as  well,  with  essay  length 
correlating  most  strongly  at  .43.  Syntactic  and  lexical 
characteristics  influenced  the  scoring  of  the  nonnative 
Hispanic  speakers  of  English. 

On  a  guest ionnaire  given  prior  to  the  scoring,  the 
special  scorers  indicated  that  organization,  thesis, 
support,  and  ideas  were  significant  to  them,  character- 
istics that  proved  influential  in  their  special  scoring. 
Differences  were  also  noted  between  experienced  and 
inexperienced  scorers,  with  the  experienced  scorers  tending 
to  score  more  harshly.  This  finding  was  similar  to 
Sweedler-Brown ' s  (1985),  as  discussed  in  the  previous 
section  of  the  review. 

Breland  and  Jones  (1984)  note  that  the  score 
reliability  of  one  writing  sample  rated  by  two  readers  has 
been  found  to  range  typically  from  .38  -  .58.  They  stress 
the  need  for  caution  in  interpreting  the  results  of  their 
study.  They  speculate  that  the  special  scorers  may  have 
been  unduly  influenced  by  the  evaluation  form  and  by  the 
targeted  groups  of  students,  and  they  point  out  that  their 
sample  contained  above-average  students .  The  researchers 
observe  that  in  their  study,  length  greatly  influenced 
holistic  scores,  implying  possibly  the  importance  of 
development  in  argumentative  essays;  they  call  attention  to 
the  importance  that  content  and  organization  played  in  the 
holistic  scores. 


41 


Mechanics,  Sentence  Structure,  Vocabulary 

As  can  be  seen  from  these  studies ,  content  and 
organization  appear  to  be  influential  factors  in  many 
evaluators '  writing  judgments.  At  the  same  time,  other 
elements,  such  as  mechanics  in  Freedman's  (1979)  study  or 
spelling  and  length  in  Breland  and  Jones's  study  (1984), 
play  a  role  as  well.  The  extent  to  which  these  concrete, 
nonrhetorical  factors  can  influence  writing  evaluations 
comprises  the  focus  of  several  other  studies . 

Allen  (1976)  investigated  the  influence  of  mechanical 
and  grammatical  errors  on  teachers '  content  ratings  by 
preparing  four  versions  of  a  writing  sample  that  contained 
different  numbers  of  errors.  Over  400  secondary  English 
teachers  scored  one  version  apiece.  Although  Allen  noted 
several  issues  that  needed  further  exploring,  the  results 
did  not  support  his  hypothesis  that  teachers'  customary 
concerns  with  mechanical  errors  would  affect  their 
evaluation  of  rhetorical  elements . 

Rafoth  and  Rubin  (1984)  sought  to  determine  the 
significance  of  content  and  mechanics  on  college 
instructors '  evaluation  of  writing  by  rewriting  an  essay  to 
contain  stronger  or  weaker  content  and  stronger  or  weaker 
mechanics.  The  researchers  composed  three  new  versions  of 
a  timed  expository  essay  originally  written  by  a  college 
freshman,  adding  spelling  and  punctuation  errors  to  two  of 
the  versions  and  deleting  propositions  from  other  versions 


42 


in  order  to  alter  the  quality  of  content.  Of  the  four 
final  versions,  one  was  high  in  content  and  free  of  errors; 
one  was  high  in  content  and  full  of  errors;  another  was  low 
in  content  and  free  of  errors;  and  the  last  was  low  in 
content  and  full  of  errors. 

Eighty  composition  instructors  from  four  state 
universities  voluntarily  accepted  one  of  the  versions 
to  grade.  Some  instructors  were  told  to  pay  special 
attention  to  content  and  ignore  mechanics,  whereas  others 
were  told  to  pay  attention  to  mechanics  instead  of  content. 
Still  others  were  simply  told  to  read  the  paper  according 
to  their  normal  practice.  All  the  instructors  were  also 
asked  to  rate  the  paper  according  to  the  criteria  on  the 
Diederich  scale. 

A  series  of  ANOVAs  showed  that  mechanically  correct 
versions  received  higher  general  impression  scores  than  did 
those  papers  with  errors;  furthermore,  the  ratings  accord- 
ing to  the  Diederich  scale  showed  that  the  mechanically 
correct  versions  received  higher  scores  for  ideas, 
organization,  and  punctuation  than  did  those  versions  with 
the  errors  inserted.  According  to  the  researchers,  "The 
present  results  strongly  suggest  that  regardless  of  writing 
content  or  evaluative  criteria,  college  instructors' 
perceptions  of  composition  quality  are  most  influenced  by 
mechanics"  (p.  455).  They  speculate  that  graders  may  not 
distinguish  clearly  between  the  domains  of  content  and 
mechanics  in  making  their  writing  judgments. 


43 


The  researchers  acknowledge  that  the  writing  assignment 
was  limited  by  timed  conditions  and  by  the  inclusion  in 
some  versions  of  a  substantial  number  (14)  of  errors. 
However,  an  additional  limitation  seems  to  have  been  the 
use  of  only  one  essay  per  grader  per  evaluative  condition. 
If  each  grader  had  been  given  several  essays  to  score  or  if 
the  graders  had  been  given  some  training  in  using  the 
Diederich  instrument,  the  results  obtained  by  Rafoth  and 
Rubin  (1984)  might  appear  more  conclusive. 

In  another  study  of  the  way  teachers '  writing  evalu- 
ations are  influenced,  Stewart  and  Grobe  (197  9)  reexamined 
232  samples  from  an  earlier  national  writing  assessment 
program.  They  found  that  students  increased  in  the  three 
measures  of  syntactic  maturity — words  per  T-unit,  words  per 
clause,  and  clauses  per  T-unit — from  grade  5  to  grade  11. 
In  addition,  students  improved  in  their  command  of  spelling 
and  in  the  avoidance  of  run-on  sentences;  they  did  not 
improve  to  the  same  extent  in  their  avoidance  of  unclear 
pronoun  reference  or  avoidance  of  sentence  fragments. 

The  features  which  best  predicted  the  quality  ratings 
in  grades  8  and  11  were  the  number  of  words  and  spelling; 
only  in  grade  5  were  the  syntactic  maturity  measures  at  all 
significant.  While  speculating  that  the  teachers  in  grades 
8  and  11  may  have  been  influenced  more  by  content  and 
organization  than  by  sentence  maturity,  Stewart  and  Grobe 
(1979)  express  dismay  at  the  lack  of  concern  seemingly 
shown  for  syntactic  development. 


44 


In  a  subsequent  study  Grobe  (1981)  compared  analytic 
ratings  completed  by  18  trained  graders  to  the  holistic 
scores  assigned  narratives  written  by  437  5th,  8th,  and 
11th  grade  students.  As  in  the  earlier  study,  composition 
length  and  the  absence  of  spelling  errors  proved  to  be 
significant  factors  in  predicting  the  holistic  score. 

To  explain  the  holistic  variance  unaccounted  for  by 
the  14  syntax  and  mechanics  variables,  Grobe  (1981) 
subsequently  added  several  vocabulary  measures  to  the 
analytic  rating  system.  A  computer  program  analyzed  50 
essays  selected  at  random  from  each  grade  level.  Spelling 
continued  to  be  important,  but  essay  length  was  less 
significant  once  vocabulary  variables  were  introduced. 
Instead,  the  vocabulary  variable  which  indicated  the  number 
of  different  words  in  a  composition  became  significant, 
leading  Grobe  to  conclude  that  vocabulary  diversity  is 
important  in  good  narrative  writing. 

The  importance  of  mature  vocabulary  and  complex  syntax 
on  writing  evaluation  comprised  the  focus  of  a  study  by 
Neilsen  and  Piche  (1981),  who  created  four  versions  of  a 
250-word  descriptive  passage  on  a  winter  scene.  One 
passage  contained  complex  nominals  and  mature  vocabulary; 
a  second  contained  complex  nominals  and  simple  vocabulary 
(as  in  the  use  of  the  word  "face"  instead  of  "confront"); 
a  third  contained  simple  nominals  (as  in  the  phrase  "like 
cattle  in  a  barren  field"  instead  of  "like  cattle  in  a 


45 


barren,  frozen  field  of  blowing  snow";  the  fourth  contained 
simple  nominals  and  simple  vocabulary. 

Eighty  high  school  English  teachers  were  given  folders 
with  one  version  of  the  passage.  They  assigned  holistic 
scores  to  the  essays  and  rated  them  according  to  a  scale 
containing  bipolar  descriptions  of  qualities,  such  as 
"logical  .  .  .  illogical."  Results  of  an  ANOVA  indicated 
that  nominal  complexity  did  not  significantly  affect  either 
the  holistic  scores  or  the  composition  scales;  however, 
vocabulary  did  have  a  significant  impact  on  both. 

The  authors  note  the  following  limitations  of  the 
study:  The  constructed  passage  might  not  resemble  actual 
student  writing,  verbs  comprised  the  only  basis  for  the 
vocabulary  differences,  and  the  findings  they  obtained  from 
descriptive  passages  might  not  apply  to  other  modes. 
Despite  these  limitations,  vocabulary  seems  clearly  to  have 
influenced  the  holistic  scores  for  this  descriptive  essay, 
just  as  it  influenced  the  narrative  writing  in  Grobe's 
(1981)  study. 

Length  and  Surface  Features 

Thus,  the  above  studies  suggest  that  such  factors  as 
mechanics,  spelling,  syntactic  maturity,  and  vocabulary  may 
affect  some  judgments  of  writing  quality.  As  has  been 
seen  in  the  previously  cited  studies  by  Breland  and  Jones 
(1984),  Grobe  (1981),  and  Roberts  (1982),  length  has  also 
been  a  contributing  factor. 


46 


Length  proved  similarly  influential  in  a  study 
conducted  by  Nold  and  Freedman  (1977)  to  explore  whether 
certain  elements  could  be  identified  as  contributing  to 
readers'  evaluations  of  compositions.  The  researchers  used 
four  argumentative  essays  written  by  each  of  22  Stanford 
freshmen.  The  essays  were  typed,  and  then  six  experienced 
teachers  were  trained  to  score  the  papers  holistically. 

Nold  and  Freedman  (1977)  hypothesized  that  four  main 
categories  might  prove  influential:  the  extent  to  which 
ideas  were  developed,  the  organization  of  those  ideas,  the 
complexity  of  syntax,  and  the  adequacy  of  vocabulary. 
Emphasizing  countable,  syntactic  elements  within  the  essay, 
Nold  and  Freedman  analyzed  the  essays  according  to  an 
instrument  developed  by  Golub  and  supplemented  by  such 
variables  as  common  verbs  and  the  length  of  the  essay. 

The  researchers  found  that  the  holistic  scores  assigned 
were  distributed  below  the  mean,  with  readers  noting  that 
they  had  expected  better  writing  of  Stanford  freshmen. 
Four  variables,  including  shortness,  overuse  of  modals 
and  be  verbs,  and  common  vocabulary,  negatively  predicted 
quality,  whereas  final  free  modifiers  positively  predicted 
quality  ratings.  According  to  the  researchers,  limitations 
included  their  focus  on  only  those  measurable  elements 
of  writing  quality  and  their  use  of  a  select  group  of 
students  as  a  sample.  Despite  any  potential  problems  with 
the  study,  length — in  addition  to  vocabulary — appears  as  a 


47 


contributing  factor  for  the  writing  judgments  made  in  this 

study  in  much  the  same  way  that  it  appeared  in  the  research 

previously  discussed  by  Grobe  (1981)  and  by  Neilson  and 

Piche  (1981).   Length  is  frequently  considered  a  surface 

feature,  and  as  White  (1985)  notes,  is  criticized  whenever 

it  is  used  as  the  basis  for  holistic  judgments.   However, 

as  Freedman  and  Calfee  (1983)  thoughtfully  observe,  length 

cannot  always  be  identified  as  a  superficial  quality  of  a 

paper.   They  note: 

The  problem  with  interpreting  such  findings  is 
that  length  may  or  may  not  be  an  index  of  a 
significant  psycholinguistic  category  such  as  idea 
development.  Longer  essays  with  fuller  develop- 
ment of  ideas  may  deserve  higher  scores  than 
shorter  essays,  but  longer  essays  padded  with 
redundant  information  may  deserve  lower  scores 
than  their  shorter  counterparts .  Correlational 
studies  do  not  reveal  why  longer  essays  receive 
higher  scores,  (p.  85) 

Even  though  the  length  of  a  paper  may  not  always 

create  a  problem  for  writing  evaluation,  other  surface 

features  such  as  handwriting  and  neatness  are  clearly 

troublesome.   Studies  completed  over  15  years  ago  (Chase, 

1968;  McColly,  1970;  and  Marshall,  1972)  have  suggested 

that  poor  handwriting  or  messy  essays  may  affect  the 

grades  assigned  to  them. 

For  example,  Chase  (1968)   found  that  16  graduate 

students  gave  "more  generous"  grades  to  essay  test  items 

done  with  good  handwriting  than  they  did  to  items  done 

with  poor  handwriting;  although  the  scorers  tended  to 

score  papers  equally  on  the  first  item,  the  negative  "halo 


48 


effect"  of  poor  handwriting  adversely  influenced  the 
scoring  of  the  second  item. 

Marshall  (1972)  introduced  various  numbers  of  spelling 
errors  into  essays  composed  in  response  to  one  American 
history  question.  For  each  essay  containing  a  set  number 
of  spelling  errors  (e.g.,  0,  6,  12,  and  18  errors), 
Marshall  prepared  a  typewritten  copy  and  had  students  copy 
the  essay  over  with  three  different  degrees  of  neatness  and 
legibility.  The  16  resulting  forms  of  the  essays  were  sent 
to  4  80  classroom  teachers  who  were  asked  to  grade  the 
papers  according  to  content.  Although  Marshall,  to  his 
surprise,  found  no  significant  differences  in  mean  scores 
for  the  levels  of  spelling  problems,  he  did  find  differ- 
ences in  the  scores  assigned  to  typed  versus  handwritten 
essays.  That  is,  all  the  handwritten  versions  of  essays 
containing  zero  to  six  errors  received  lower  scores  than 
did  the  typed  versions  of  the  same  essays;  the  results  for 
essays  containing  12  to  18  errors  were  less  clearcut  and 
seemed  to  fall  into  a  random  pattern. 

Handwriting  is  also  labeled  as  a  problem  in  McColly's 
review  (1970)  of  the  issues  comprised  in  writing  evalu- 
ation. Citing  several  studies  in  which  handwriting 
influenced  writing  judgments,  McColly  warns  that  in  such 
instances,  "the  validity  is  actually  lowered,  because 
handwriting  ability  and  writing  ability  are  not  the  same 
thing."   He  continues  by  suggesting  that  "the  only  cure 


49 


for  this  condition  is  to  have  examination  essays  typed 
or  put  into  some  other  standard  printed  format"  (p.  154). 

Stach  (1987),  too,  expresses  concern  in  his  disserta- 
tion about  the  influence  of  appearance  on  holistic  scorers ' 
judgments.  In  his  study  three  college  teachers  were 
trained  to  score  holistically  140  essays  written  by  college 
freshmen.  The  teachers  then  described  what  they  considered 
good  writing  to  be,  and  they  rank  ordered  the  importance 
they  placed  on  several  factors  in  making  writing 
evaluations.  Presumably,  they  considered  the  factor  of 
"presentation,"  which  signified  handwriting  and  neatness, 
to  be  totally  unimportant  and  mechanics  to  be  less 
meaningful  than  many  other  qualities;  however,  a  regression 
analysis  revealed  that  appearance  and  mechanics  were  the 
only  statistically  significant  predictors  of  holistic 
scores.  According  to  Stach,  the  implication  of  such 
findings  was  "that  scorers  in  holistic  procedures  (and 
perhaps  teachers  in  general)  aspire  to  grade  essays 
differently  than  they  actually  do,  and  that  they  hope 
to  be  qualitatively  better  graders  than  they  are,  over- 
looking, or  'seeing  beyond,'  mechanics  and  appearance" 
(p.  113).  Suggesting  that  the  scorers'  descriptive 
statements  reflected  not  "priorities,  but  aspirations," 
Stach  concludes,  "Certainly  there  is  a  great  gulf  between 
what  they  say  matters  to  them  and  what  the  best  statistical 
predictors  of  holistic  scores  turned  out  to  be"  (p.  120). 


50 


As  these  studies  indicate,  writing  evaluations  are 
affected  to  varying  degrees  by  such  elements  as  mechanics, 
vocabulary,  syntax,  spelling,  length,  and  even  handwriting 
or  neatness .  Such  a  link  between  these  elements  of  form 
and  the  rhetorical  elements  of  organization  and  content  is, 
according  to  Harris  (1977),  almost  inevitable.  Referring  to 
her  own  study,  which  will  be  discussed  in  the  next  section, 
Harris  comments  that  there  "came  the  conviction  that  form 
is  so  integral  a  part  of  content  that  in  some  ethereal  way 
form  is  content  and  content  is  form"  (pp.  180-181). 

Other  Factors  Involved  in  Writing  Judgments 

In  addition  to  elements  of  form  and  content,  other — 
almost  intangible — factors  in  writing  evaluation  have 
received  increasing  attention.  One  factor  is  the  discrep- 
ancy between  what  readers  say  they  value  and  what  they 
actually  reward;  the  second  factor  is  the  perspective  that 
readers  adopt  toward  the  writers  behind  the  essays. 

Harris  (1977)  sought  to  determine  those  features  that 
influenced  English  teachers  in  their  evaluation  of  student 
writing.  Thirty-six  high  school  teachers  read  12  student 
essays,  marking  them  according  to  their  customary  practice; 
they  then  ranked  the  essays  according  to  merit  and 
completed  a  questionnaire.  They  finally  reevaluated  the 
papers  against  five  criteria. 

Taken  together,  the  four  procedures  revealed  a  dis- 
crepancy between  the  criteria  teachers  rated  as  important 


51 


and  the  criteria  they  actually  demonstrated  in  their 
comments  and  markings.  That  is,  on  the  questionnaire  the 
teachers  indicated  that  content  and  organization  were  of 
great  importance  to  them,  while  their  annotations  and 
manner  of  ranking  the  papers  revealed  the  major  role  that 
mechanics  and  usage  played.  For  these  teachers,  sentence 
structure  and  diction  were  less  important.  Additional 
findings  by  Harris  (1977)  included  her  discoveries  that 
the  teachers  basically  agreed  with  each  other  about  the 
evaluation  of  writing  and  that  many  of  the  teachers' 
annotations  and  other  comments  were  negative. 

Hake  and  Williams  (1981)  raise  the  question  of  what 
teachers  of  writing  actually  do  value:  "Is  it  possible 
that  despite  our  public  declarations  about  clear,  direct 
writing,  we  might  somehow  discourage  our  students  from 
writing  good  prose  and  encourage  them,  through  our  own 
tacit  behavior,  to  write  bad?"  (p.  434)  Their  question 
arises  from  four  experiments  they  conducted  in  which  they 
altered  the  style  of  similar  essays — changing  the  direct, 
verbal  style  that  contained  a  subject/verb/object  (or 
agent/action/goal)  to  a  nominalized,  indirect  style  in 
which  abstract  nouns  predominated.  Approximately  80 
teachers,  from  high  school  to  the  upper  college  classes, 
rated  the  heavily  nominalized  papers  more  highly  than 
they  did  those  essays  which,  though  structurally  similar, 
were  directly  verbal  in  style.    In  one  experiment,  the 


52 


readers,  who  were  unaware  of  the  purpose  of  the  study, 
wrote  comments  indicating  that  the  nominalized  versions 
contained  better  organization  and  support,  even  though 
the  pairs  of  papers  were  identical  in  those  respects .  In 
another  experiment,  senior  college  graders  rated  the 
nominalized  papers  higher  than  they  did  verbal  versions 
even  when  they  could  find  major  errors  in  the  nominalized 
version. 

The  authors  speculate  that  the  good  nominalized  papers 
may  have  been  associated  with  intellectual  quality,  in 
contrast  to  the  perceived  lower  quality  of  the  verbal 
versions.  Thus,  Hake  and  Williams  (1981)  suggest  that 
despite  what  writing  teachers  claim  to  do,  one  cause  of 
"stylistic  infelicity"  (p.  446)  may  be  the  practices  of 
the  teachers  themselves . 

Still  another  source  of  complexity  in  writing 
evaluation  is  the  attitude  or  expectations  of  the  readers 
toward  the  writers  of  the  essays.  For  example,  Freedman 
(1984)  gave  to  four  experienced  holistic  scorers  packets  of 
essays  containing  not  only  the  writings  of  students  from 
four  different  colleges  but  also  a  timed  essay  composed  by 
a  professional  writer  on  the  same  topic.  The  scorers,  who 
were  unaware  that  professional  writings  had  been  included 
in  the  study,  gave  only  slightly  higher  mean  holistic 
scores  to  the  professionals  than  they  did  to  the  student 
writers.  (In  fact,  student  writers  received  the  three 
highest  holistic  scores.) 


53 


The  professional  writers  received  higher  analytic 
scores  than  did  the  students  in  the  categories  of  voice, 
sentence  structure,  word  choice,  and  usage,  but  they 
received  lower  scores  in  the  categories  of  development 
and  organization.  Because  of  the  low  scores  that  had  been 
given  in  these  categories  and  because  of  wide  differences 
in  the  holistic  scores  assigned  to  these  papers,  Freedman 
(1984)  sought  to  discover  what  gualities  characterized  the 
professional  essays.  She  found  four  common  traits:  (a)  a 
tone  of  familiarity,  (b)  an  initial  rejection  of  the  task 
with  a  subsequent  acceptance  of  it,  (c)  a  final  commitment 
to  the  topic  with  resulting  forcefulness  in  the  papers,  and 
(d)  scholarly  references. 

As  these  traits  are  unlikely  to  appear  in  most 
students'  writing,  the  author  speculates  that  the  scorers 
may  have  negatively  reacted  to  what  they  viewed  as 
"overstepping"  of  authority  on  the  part  of  some  students. 
She  advocates  that  teachers  encourage  students  to  write 
with  authority  and  freedom. 

Sullivan  (1986)  also  explored  whether  holistic  scorers' 
disagreement  in  problem  papers  about  discourse  issues 
reflected  certain  attitudes  toward  the  writer  behind  the 
essays.  Stressing  that  "evaluation  of  writing  ability  is 
best  viewed  as  a  multifunctional  social  interaction" 
(p.  11),  Sullivan  sought  to  determine  whether  readers 
created  writers  in  addition  to  the  meaning  of  the  texts. 


54 


Sullivan  (1986)  randomly  selected  for  analysis  99 
essays  that  had  been  written  by  entering  freshmen  and 
holistically  scored.  Topics  for  the  essays  had  required 
students  to  argue  to  a  specified  audience  a  certain 
position  on  a  controversial  issue.  Using  Prince's  "Taxonomy 
of  Assumed  Familiarity, "  Sullivan  classified  the  informa- 
tion contained  in  the  noun  phrases  of  the  essays  in  terms 
of  assumptions  made  about  readers'  familiarity  with  the 
information — that  is,  whether  it  was  assumed  to  fall 
under  new,  inferable,  or  old  (Evoked)  categories  (p.  14). 
A  regression  analysis  indicated  that  three  of  the  sub- 
categories of  information  significantly  correlated  with 
holistic  scores.  According  to  Sullivan,  these  categories 
represent  deviations  from  Grice's  Cooperative  Principle 
in  which  writers  are  supposed  to  assume  that  the  readers 
have  reasonable  familiarity  with  the  information.  He 
speculates  that  these  deviations  from  expected  norms 
reflect  three  different  identities — that  of  the  "test- 
taker,"  the  "knowledgeable  student,"  and  the  "straight- 
forwardly cooperative  writer"  (p.  33) — and  that  readers 
were  responding  either  negatively  or  positively  to  these 
identities.  He  stresses  the  need  for  additional  research 
to  determine  whether  readers  are  evaluating  texts  on  the 
basis  of  their  responses  to  the  writers'  identities. 

Though  intriguing,  much  of  Sullivan's  (1986)  work 
appears  highly  speculative;  for  example,  the  basis  behind 


55 


his  claims  that  certain  linguistic  categories  of  informa- 
tion reflect  particular  social  identities,  such  as  that  of 
the  "test  taker,"  seems  arbitrary.  Moreover,  as  Sullivan 
himself  acknowledges,  the  hypothetical  audience  that 
students  were  required  by  the  topics  to  address  and  that 
conflicted  with  the  real  audience  of  holistic  scorers  may 
have  compounded  students '  uncertainties  about  how  much 
information  they  needed  to  provide.  But  despite  the 
problems  that  Sullivan's  work  contains,  his  research 
illustrates  the  potential  impact  that  the  writers  them- 
selves may  have  on  the  readers'  evaluation  of  their  work. 
Barritt,  Stock,  and  Clark  (1986)  found  a  similar 
attitude  of  unease  held  by  readers  toward  writers  who  do 
not  adhere  to  their  expected  role.  A  group  of  faculty 
members  of  the  University  of  Michigan's  English  Composition 
Board  met  periodically  over  a  two-year  period  to  discuss 
how  they  holistically  rated  student  placement  essays  and 
why  they  sometimes  disagreed  with  each  other.  At  each 
meeting,  they  read  a  selected  essay,  scored  it  privately, 
and  noted  their  reasons  for  the  score;  then  they  discussed 
their  findings  together. 

They  found  that  on  those  essays  which  evoked  the  most 
disagreement,  the  comments  fell  into  several  categories: 
(a)  "the  written  text";  (b)  "the  imagined  student  writer"; 
and  (c)  the  "prospective  student"  (p.  319).  The  authors 
note  that  even  though  they  initially  urged  readers  to  pay 


56 


attention  to  the  texts,  rather  than  to  the  writers  behind 

the  work,  their  recommendations  were  in  vain.  They  justify 

the  readers'  reactions  by  emphasizing  the  importance  of  the 

expectations  the  readers  bring  to  the  reading: 

We  had  forgotten  that  reading  is  always  an  act  of 
recreation  and  that  what  we  have  learned  as 
students  of  literary  theory  has  much  to  teach  us 
about  what  we  do  as  we  read  our  students '  assess- 
ment essays.  In  our  case,  as  reader/evaluators 
asked  to  judge  placement  essays,  we  had  to  engage 
ourselves  as  active  readers  trying  to  make  common 
sense — that  is  sense  in  common — with  student 
authors.  We  found  ourselves  working  mentally  with 
each  student  writer  to  compose  a  placement  essay; 
as  we  overlaid  the  student's  writing  with  our  own 
expectations,  we  completed  incomplete  arguments, 
supplied  missing  transitions,  second-guessed 
particular  cases  for  general  statements . 

Like  the  readers  Wolfgang  Iser  posits,  we  were 
trying  to  build  consistency  into  students '  texts 
by  investing  spaces  of  indeterminacy  in  them  with 
our  own  expectations  about  what  should  fill  the 
gaps ....  The  teaching  experience  each  of  us 
brought  to  the  task  of  evaluating  student  texts 
led  us  to  expect  in  each  text  the  writing  of  a 
'typical'  college  freshman,  and  our  expectations 
influenced  our  readings.   (p.  320) 

Arguing  against  the  need  always  to  have  consistency  of 
judgment,  Barritt  et  al .  (1986)  suggest  that  it  is  more 
important  to  accept  and  understand  the  basis  behind  those 
judgments  that  are  not  in  agreement. 

In  a  similar  vein,  Martin  (1987)  explored  the  process 
that  occurs  for  readers  in  a  placement  scoring.  She 
examined  the  written  responses  that  three  faculty  members 
made  to  six  placement  essays  composed  by  entering  college 
students.   The  instructors,  all  of  whom  were  experienced 


57 


scorers,  ranked  the  papers  once  and  wrote  comments  intended 
for  the  students  to  use  in  revising;  at  a  later  time,  they 
ranked  the  papers  again  and  wrote  comments  intended  for 
the  researcher.  Martin  studied  the  comments  in  the  light 
of  the  readers '  own  backgrounds  and  their  own  experiences 
with  reading  and  writing.  She  concludes  by  observing  that 
readers,  as  well  as  writers,  are  individuals  and  that  the 
essays  do  not  necessarily  contain  features  which  all 
readers  can  assess;  rather,  in  her  view,  placement  scorers 
are  primarily  concerned  with  the  extent  to  which  the 
writing  samples  indicate  students'  readiness  for  college 
tasks . 

Summary  of  Literature  Review 

Together,  the  three  major  sections  of  the  literature 
review  reveal  complex  links  between  holistic  scoring  and 
the  writing  criteria  on  which  it  is  based.  The  studies  of 
the  first  section  (Diederich,  French,  and  Carlton,  1961; 
Godshalk,  Swineford,  and  Coffman,  1966;  Freedman,  1981; 
Hrach,  1983;  Sweedler-Brown,  1985;  and  Huot,  1988) 
illustrate  both  the  development  of  and  rationale  for 
various  holistic  procedures,  including  the  training  of 
readers.  The  studies  of  the  second  section  (Winters,  1978; 
Roberts,  1982;  Shoaf,  1985;  and  Janopoulos,  1987)  depict 
the  strengths  and  weaknesses  of  holistic  scoring  in 
comparison  to  other  evaluation  systems .  The  studies  of  the 


58 


last  section  reflect  researchers'  attempts  to  explore — in 
a  variety  of  contexts — the  elements  involved  in  the 
evaluation  of  writing.  From  this  section  emerges  a  picture 
of  readers  in  some  contexts  primarily  influenced  by  organi- 
zation and  content  (Diederich,  1974;  Freedman,  1979;  and 
Breland  and  Jones,  1984)  and  of  readers  in  other  contexts 
chiefly  concerned  with  such  features  as  mechanics, 
spelling,  vocabulary,  and  length  (Harris,  1977;  Nold  and 
Freedman,  1977;  Grobe,  1981;  Rafoth  and  Rubin,  1984;  and 
Stach,  1987).  Still  other  studies  convey  how  readers' 
perceptions  and  expectations  of  writers  affect  some  evalu- 
ations (Freedman,  1984;  Sullivan,  1986;  and  Barritt,  Stock, 
and  Clark,  1986).  The  involvement  of  so  many  factors  in 
writing  judgments  underscores  not  only  the  need  in  writing 
assessment  for  such  structured  approaches  as  holistic 
scoring  but  also  the  need  for  training  and  monitoring  to 
ensure  some  similarity  in  the  perspectives  that  readers 
bring  to  their  evaluations .  But  if  the  need  for  training 
is  clear,  the  nature  of  that  training  and  monitoring  in 
holistic  scoring  has  not  yet  been  fully  explored. 

Holistic  scoring  of  assessment  essays  entails  special 
circumstances  both  for  writers  and  for  readers:  That  is, 
just  as  writers  in  an  assessment  often  have  a  limited 
time  in  which  to  discuss  a  given  topic  for  an  unfamiliar 
audience,  so,  too,  do  holistic  readers  have  a  short  time  in 
which  to  determine  the  meaning  of  a  text  and  respond  by 


59 


evaluating  it.  Questions  thus  remain  as  to  how  the 
training  and  monitoring  of  a  structured  holistic  scoring 
help  to  create  a  community  of  readers  who  willingly 
accommodate  their  own  writing  criteria  to  the  writing 
standards  of  the  group  as  a  whole. 

The  methodology  used  in  the  present  study  to  explore 
the  impact  of  monitoring  on  an  holistic  scoring  is 
described  in  Chapter  3 . 


CHAPTER  3 
METHODOLOGY 


The  impact  that  training  and  monitoring  have  on  an 
holistic  scoring  of  writing  was  explored  from  three 
perspectives:  (a)  A  monitored  holistic  scoring  was  con- 
ducted in  which  12  readers  scored  over  100  student -written 
essays;  8  of  the  readers  recounted  their  responses  to  these 
essays  through  the  use  of  logs,  and  4  of  the  readers 
recorded  their  reactions  through  the  use  of  audio-taped 
protocols;  (b)  the  same  12  readers  also  scored  an  egual 
number  of  essays  at  home  in  an  unmonitored  situation,  again 
using  logs  and  audiotapes;  and  (c)  all  participants  in  the 
study — including  the  3  table  leaders  and  2  chief  readers — 
were  administered  a  questionnaire  regarding  their  attitudes 
to  writing  evaluation  and  to  the  holistic  scoring  process. 

The  Monitored  Holistic  Scoring 

A  monitored  holistic  scoring  was  conducted  to  replicate 
on  a  small  scale  the  structured  scorings  used  in  the 
writing  assessment  of  college  sophomores  throughout  the 
state  of  Florida.  The  chief  reader  for  the  state  of 
Florida,  together  with  an  associate  chief  reader,  conducted 
the  scoring  on  Saturday,  January  7,  1989.  Permission  was 
obtained  from  the  Department  of  Education  in  Tallahassee 


60 


61 


to  select  by  means  of  a  stratified  random  sampling  over  100 
essays  used  two  years  previously  in  an  administration  of 
the  College  Level  Academic  Skills  Test  (CLAST) . 

Subjects  in  the  Study 

Seventeen  men  and  women  who  are  highly  experienced 
holistic  scorers  and  who  have  taught  English  at  different 
levels  were  the  subjects.  Earlier  studies  by  Follman  and 
Anderson  (1967)  and  by  Freedman  (1979)  had  found  the 
homogeneity  of  their  scorers  to  be  a  factor  in  the  results; 
in  fact,  Freedman  cites  such  homogeneity  as  a  limitation  of 
her  work.  Because  most  CLAST  scorings  employ  English 
instructors  from  diverse  levels,  it  was  assumed  that 
subjects  with  a  broad  base  of  English  teaching  would  more 
accurately  reflect  real-life  conditions  found  in  an 
holistic  scoring  session. 

The  chief  reader  for  the  study,  a  former  director  of 
freshman  composition  at  a  large  university,  is  the  current 
chief  reader  for  the  state  of  Florida  and  has  directed 
many  large-scale  holistic  scorings.  The  assistant  chief 
reader,  a  former  chair  of  a  high  school  English  department 
and  an  Advanced  Placement  English  teacher,  has  frequently 
served  in  the  role  of  assistant  chief  reader. 

Eight  women  and  seven  men  participated  in  this  study 
either  as  table  leaders  or  as  readers .  Five  taught  in 
three  local  high  schools,  with  some  instructing  Advanced 
Placement  English  classes  or  participating  in  the  Writing 


62 


Enhancement  Program;  several  had  studied  in  the  Florida 
Writing  Project.  Another  five  were  English  faculty  from 
three  community  colleges  within  a  200-mile  radius.  The 
remaining  five  were  from  three  universities  or  four-year 
colleges  within  a  100-mile  radius.  The  teaching  experience 
of  the  15  participants,  in  addition  to  the  2  chief  readers 
described  above,  ranged  from  8  to  37  years,  with  an  average 
of  18  years;  their  holistic  scoring  experience  averaged 
7  years.  All  but  one  participant  had  holistically  scored 
other  types  of  examinations  as  well  as  the  CLAST.  The 
subjects  received  an  honorarium  for  their  participation 
in  the  study. 

Writing  Samples 

The  essays  used  in  the  study  were  written  by  college 
students  nearing  the  end  of  their  sophomore  year  as  part 
of  a  state-wide  mandatory  test  to  assure  minimal  compe- 
tencies in  reading,  writing,  and  mathematics.  Students 
were  given  a  choice  of  two  topics,  each  of  which  required 
them  to  draw  upon  their  general  knowledge,  to  create  a 
thesis,  and  to  support  it  during  the  50  minutes  allotted 
for  writing.  Because  of  test  security  purposes,  the  topics 
cannot  be  revealed.  However,  they  followed  the  paradigm 
developed  by  Hoetker  and  Brossell  (1986)  and  used  in 
Florida  for  several  years;  the  paradigm  typically  is 
a  fragment,   containing  a  class   specification  and  two 


63 


differentiating  criteria.  The  paradigm  is  exemplified  by 
such  topic  phrases  as  "a  book/  that  many  students  read/ 
that  may  affect  them  beneficially"  or  "a  common  practice/ 
in  American  colleges/  that  should  be  changed"  (p.  330) 
which  Hoetker  and  Brossell  describe  in  their  research. 

Procedure 

As  students  taking  CLAST  have  a  choice  of  two  topics, 
holistic  scorers  are  accustomed  to  scoring  sets  of  papers 
in  which  two  different  topics  are  intermingled;  conse- 
quently, the  essays  used  in  the  study  were  not  separated 
out  by  topic  for  scoring. 

As  shown  in  Table  1,  a  stratified  random  sampling 
procedure  was  used  to  select  112  essays  which  would 
approximate  the  distribution  of  scores  obtained  in  the 
actual  scoring  of  these  essays;  hence,  the  essays  reflected 
the  writing  of  students  from  various  institutions  in 
different  parts  of  the  state.  On  the  basis  of  scores 
originally  assigned  to  the  papers,  the  papers  were  randomly 
divided  in  half  for  the  monitored  and  unmonitored  scorings. 
Thus,  the  papers  with  scores  of  8  were  distributed  equally 
to  the  two  treatments,  as  were  the  papers  with  scores  of 
2.i  1,  5,  6,  and  1_.  To  ensure  students'  anonymity,  all 
identifying  information  was  removed,  and  each  essay  was 
labeled  with  a  three-digit  number;  the  essays  were  then 
reproduced  so  that  each  reader  would  have  a  copy  of  all 


64 


TABLE  1 
Descriptive  Statistics  for  the  Stratified  Random  Sampling 


Total 

Cumulative  Cumulative 

Score 

Frequency 

Percent 

Frequency 

Percent 

Frequency 

of 

Scores  for 

the  Actual 

Essays 

2 

1779 

9.9 

1779 

9.9 

4 

4666 

26.0 

6445 

36.0 

5 

4535 

25.3 

10980 

61.3 

6 

4098 

22.9 

15078 

84.1 

7 

2172 

12.1 

17250 

96.2 

8 

675 

3.8 

17925 

100.0 

Frequency 

of 

Scores  for 

the  Sample 

Essays 

2 

11 

9.8 

11 

9.8 

4 

29 

25.9 

40 

35.7 

5 

28 

25.0 

68 

60.7 

6 

26 

23.2 

94 

83.9 

7 

14 

12.5 

108 

96.4 

8 

4 

3.6 

112 

100.0 

65 


56  papers.  The  papers  were  randomly  distributed  in  4 
packets  of  approximately  14  to  each  reader. 

Three  tables  were  established  for  the  scoring,  each 
consisting  of  four  readers  and  an  experienced  table  leader, 
all  of  whom  were  randomly  assigned  to  their  table.  Two 
tables  of  four  readers  each  followed  regular  holistic 
scoring  procedures  and  were  used  for  comparative  statis- 
tical purposes  in  the  study;  the  third  table  was  treated 
as  a  separate  entity.  That  is,  the  four  readers  at  the 
third  table  took  part  in  the  training  procedures  but  then 
adjourned  to  small,  adjacent  offices  to  tape  record  both 
their  reading  of  the  essays  and  their  reactions  to  these 
essays . 

Each  of  the  eight  readers  participating  in  the  regular 
scoring  was  assigned  56  papers  to  score,  a  number 
arbitrarily  chosen  for  several  reasons.  It  was  manageable 
enough  to  facilitate  the  subsequent  interpretation  of  data 
and  yet  affordable.  It  also  represented  a  large  enough 
sample  to  reveal  any  scoring  tendencies  on  the  readers' 
part  and  to  indicate  any  potential  influence  of  training 
samples,  breaks,  and  monitoring  procedures.  However,  prior 
to  any  statistical  analysis,  five  sets  of  data  were 
subsequently  removed  from  the  study:  One  set  of  matched 
papers  was  deleted  because  poor  photocopying  had  made  one 
of  the  two  essays  impossible  to  read;  four  sets  were 
removed  as  deviant  data  when  the  chief  readers '  independent 


66 


scoring  beforehand  of  the  entire  group  of  papers  revealed 
that  some  papers  had  been  incorrectly  scored  two  years 
earlier  and  hence  were  inappropriately  matched.  Thus,  the 
actual  data  for  the  study  comprised  51  sets  of  matched 
papers,  or  102  essays  altogether. 

The  four  readers  at  the  third  table,  hereafter  referred 
to  as  "special  readers,"  were  given  a  subset  of  20  papers 
to  score  by  means  of  talking  protocols .  Because  they 
scored  far  fewer  essays  than  did  the  eight  regular  readers, 
the  special  readers  were  not  included  in  any  statistical 
analysis.  Results  of  the  special  readers'  scoring  will  be 
discussed  separately  from  results  obtained  from  the  regular 
readers . 

Procedures  for  Training 

The  scoring  adhered  to  the  customary  procedures,  with 
rangefinders  provided  initially  for  training  purposes, 
followed  by  the  presentation  of  several  samples  throughout 
the  scoring  for  the  group  to  score  and  tally  together. 
Reading  breaks  occurred  at  approximately  4 5 -minute  inter- 
vals, and  table  leaders  monitored  the  scoring  throughout. 
The  chief  readers  conducted  two  check  readings  as  an  addi- 
tional verification  that  all  participants  were  scoring  the 
papers  comparably.  In  these  respects,  then,  the  scoring 
represented  a  replication  of  the  procedures  typically  used 
in  assessing  writing  on  a  large  scale. 


67 


The  training  and  monitoring  procedures  also  followed 
custom  insofar  as  readers  were  urged  to  employ  the  full 
range  of  scores  (e.g.,  from  .1-4)  in  assigning  scores  to 
the  six  rangefinders.  Rangefinders  from  the  original 
reading  were  read  first,  with  the  readers  asked  to  rank 
order  the  papers  and  to  assign  each  of  the  four  scores  to 
at  least  one  essay.  Then  the  readers'  scores  were  publicly 
tallied.  If  one  or  two  scores  clearly  differed  from  the 
scores  assigned  by  other  readers,  readers  whose  scores  were 
discrepant  were  urged  to  look  the  paper  over  again. 

Once  the  rangefinders  were  tallied,  table  leaders, 
who  kept  running  accounts  of  the  vote  at  their  tables, 
led  their  table  in  a  brief  discussion  of  why  papers 
received  certain  scores .  They  referred  to  the  operational 
definitions  if  necessary.  Then  pairs  of  sample  essays  were 
introduced,  with  readers  again  asked  to  read  and  score  an 
essay  and  raise  their  hands  as  each  score  level  was 
announced  by  the  chief  reader.  Samples  were  given  until 
the  group  reached  a  consensus  on  most  scores . 

Special  Measures  Used  for  the  Study 

For  the  purpose  of  this  study,  several  new  measures 
were  introduced.  All  eight  readers  at  the  two  regular 
tables,  Table  1  and  Table  2,  were  reading  the  same  papers 
arranged  in  random  order  in  4  packets  of  approximately  14 
papers  each.  (The  third  table  will  be  discussed  subse- 
quently.) Thus,  for  each  essay,  at  least  eight  scores  were 
obtained,  four  from  each  of  two  tables. 


68 


As  the  readers  scored  each  paper,  they  were  asked  to 
jot  down  several  brief  comments  about  their  overriding 
impression  of  the  paper,  its  key  strengths  and  weaknesses. 
The  commentary,  therefore,  provided  a  running  log  that  was 
used  to  explore  the  basis  on  which  the  readers  made 
particular  scoring  judgments.  This  "process  log"  resembled 
the  log  developed  for  writers  by  Faigley,  Cherry,  Jolliffe, 
and  Skinner  (1985).  In  addition,  readers  noted  such 
procedures  as  the  time  they  began  each  reading  session 
after  a  break,  their  scores  on  samples,  and  any  adjustments 
they  made  after  talking  to  their  table  leaders  or  after 
consulting  the  rangef inders .  During  the  previous  month 
readers  had  been  given  instructions  in  how  to  use  the 
logs  before  they  began  their  unmonitored  scoring.  A  copy 
of  the  log  is  provided  in  Appendix  A. 

In  actual  scorings,  chief  readers  customarily  keep  logs 
as  part  of  their  procedures,  noting  such  details  as  the 
time  of  each  reading,  the  samples  used  for  training,  and 
the  start  of  check  readings.  For  the  purpose  of  this  study, 
chief  readers  were  asked  to  maintain  their  customary  log, 
but  it  was  labeled  an  "Account  of  Procedures "  in  order  to 
avoid  being  confused  with  the  log  or  running  commentary 
employed  by  the  readers.  The  actual  account  is  included 
in  Appendix  B. 

In  including  the  written  observations  of  readers,  this 
study  partially  followed  the  procedures  used  by  Diederich 
et  al.   (1961),  who  asked  their  readers  to  note  comments  as 


69 


they  sorted  papers  into  piles  and  assigned  rankings. 
However,  as  noted  in  the  literature  review,  the  readers 
in  Diederich's  study  came  from  diverse  backgrounds, 
received  no  training,  and  worked  in  an  unstructured 
situation.  Logs  have  also  been  used  in  one  study  under- 
taken by  Murphy,  Carroll,  Kinzer,  and  Robyns  (1982)  with 
the  Bay  Area  Writing  Project  (pp.  397-410). 

The  use  of  written  comments  was  selected  for  this  study 
as  opposed  to  the  annotations  used  by  Breland  and  Jones 
(1984)  in  their  study  of  writing  perceptions.  Despite  the 
difficulties  entailed  in  categorizing  written  observations, 
such  comments  are  far  less  apt  to  disrupt  the  momentum  of 
the  holistic  scoring  than  the  more  analytic  checklist  that 
Breland  and  other  researchers  have  employed.  Moreover, 
unlike  the  analytic  checklists  which  provide  readers  with 
lists  of  certain  criteria,  blank  log  sheets  are  not  apt 
to  influence  readers'  responses.  Indeed,  written  comments 
are  currently  used  during  the  sample  selection  part  of 
actual  holistic  scoring  procedures,  when  the  chief  readers 
assemble  to  select  the  papers  to  be  used  as  training 
samples  during  a  scoring. 

Table  leaders  were  also  asked  to  keep  logs  and  to 
note  such  monitoring  procedures  as  whose  papers  needed 
rereading,  what  discussions  about  writing  ensued,  and 
whether  many  scores  needed  to  be  altered.  In  addition, 
they  described  their  readers'  performance  during,  and 
reaction  to,  the  use  of  training  samples. 


70 


As  noted  previously,  the  third  or  special  table, 
consisting  of  a  table  leader  and  four  experienced  readers, 
also  participated  with  the  other  tables  in  the  use  of 
rangefinders  and  sample  essays.  However,  at  the  conclusion 
of  the  training  papers,  the  readers  adjourned  to  separate 
small  offices  to  record  on  audiotapes  their  ongoing 
reactions  to  a  subset  of  the  papers  used  in  the  monitored 
scoring.  Such  protocol  analysis  has  been  used  in 
composition  research  for  a  number  of  years  and  was  recently 
employed  by  Huot  (1988)  in  his  study  of  holistic  scorers. 
The  subset  of  20  papers,  like  the  larger  one  of  50+  essays, 
was  deliberately  selected  to  contain  a  range  of  score 
levels  and  was  assigned  in  random  order  to  each  of  the  four 
special  readers.  The  number  20  was  arbitrarily  chosen  to 
allow  for  the  extra  time  readers  might  need  to  read  each 
paper  aloud  and  record  their  impressions  and  observations. 
The  table  leader  for  the  special  group  moved  among  the  four 
offices  to  monitor  the  scoring  and  to  discuss  any  discrep- 
ancies. Through  these  talking  protocols  some  indepth 
insights  were  provided  as  to  how  the  monitored  scoring 
appeared  to  influence  the  scores  that  readers  assigned. 

The  Questionnaire 

A  questionnaire  devised  for  this  study  was  given  to 
all  the  participants  immediately  following  the  completion 
of  the  monitored  holistic  scoring  in  order  that  the 
respondents '  written  logs  or  protocols  during  the  scoring 


71 


not  be  influenced  by  the  nature  of  the  questions .  The 
questionnaire,  a  copy  of  which  is  included  in  Appendix  A, 
contained  three  main  categories  of  questions — (a)  readers' 
ratings  of  the  importance  of  certain  features  in  writing, 
(b)  their  self -report  of  their  own  biases  in  readings  and 
their  methods  for  dealing  with  these  biases,  and  (c)  their 
reactions  to  the  structured  setting  of  an  holistic  scoring. 
Most  items  required  closed  responses,  although  several 
allowed  for  open-ended  responses.  An  additional  section 
enabled  table  leaders  and  chief  readers  to  address 
questions  dealing  with  their  roles  as  monitors . 

The  questionnaire,  designed  in  accordance  with  the 
principles  set  forth  by  Berdie  and  Anderson  (1974),  was 
pilot  tested  two  months  previously  by  holistic  scorers 
of  the  CLAST  at  another  scoring  site  in  the  state.  Over 
60  percent  of  the  readers  and  table  leaders  at  the  second 
site  voluntarily  completed  the  questionnaire  and  responded 
to  specific  questions  regarding  the  substance,  format,  and 
clarity  of  the  instrument.  (See  Appendix  A  for  a  copy  of 
the  pilot  questions.)  A  stamped,  self-addressed  envelope 
was  provided  for  the  return  of  the  pilot  questionnaires. 
The  respondents  made  specific  suggestions  for  wording 
changes,  and  they  asked  for  additional  items  to  be  included 
in  parts  I  (features  of  writing)  and  II  (biases).  In 
addition,  several  requested  that  the  absolute  categories 
of  "never"  and  "always"  be  provided  as  options  in  parts  III 


72 


and  V.   Many  of  the  respondents  indicated  that  answering 
the  questionnaire  had  been  an  interesting,  challenging, 
or  educational  experience  for  them. 

The  Unmonitored  Holistic  Scoring 

During  the  month  prior  to  the  monitored  holistic 
scoring,  each  of  the  eight  regular  readers  was  asked  to 
score  holistically  at  home  four  packets — over  50  papers — 
of  matched  essays  written  by  different  students  on  the 
same  topics.  Readers  were  asked  to  jot  down  their  impres- 
sions of  the  papers  in  the  running  log,  just  as  they  were 
subsequently  asked  to  do  during  the  monitored  scoring 
session.  Readers  were  sent  instructions  on  how  to  use  the 
log,  as  well  as  a  copy  of  the  operational  definitions 
currently  used  in  the  CLAST  administration  (see  Appendix 
A).  These  definitions,  which  describe  the  characteristics 
typical  of  a  certain  level  of  essay,  were  the  only  training 
materials  provided  to  the  readers  in  the  unmonitored 
setting. 

The  papers  were  scored  over  a  four-week  period. 
These  papers  with  their  scores  and  comments  reflected 
how  experienced  holistic  scorers  scored  without  being 
monitored  and  without  being  part  of  a  group  situation. 
During  the  unmonitored  scoring,  the  table  leaders  and 
the  chief  readers  were  assigned  different  tasks  from  the 
readers.   For  example,  the  chief  readers  met  to  review  the 


73 


entire  group  of  papers  to  be  used  in  the  study;  they 
discussed  each  score  until  they  agreed  upon  an  appropriate 
rating  for  each  essay.  This  practice,  while  certainly  not 
typical  of  an  actual  holistic  scoring,  was  included  to 
ascertain  the  chief  readers'  scores  for  the  entire  set, 
thereby  helping  to  answer  the  question  posed  as  to  the  role 
the  chief  readers  play  in  influencing  scores  that  are 
given.  In  addition,  the  table  leaders  read  all  the  sample 
papers  and  rangefinders  to  be  used  in  the  subsequent 
monitored  scoring,  rating  each  paper  and  writing  their 
responses  to  each  essay.  This  procedure  represented  a 
departure  from  typical  procedures.  That  is,  under  normal 
circumstances,  the  table  leaders  meet  with  the  chief 
readers  prior  to  a  scoring  to  read  and  score  the  sample 
papers  the  chief  readers  have  selected;  then  they  discuss 
the  results  together. 

Methods  Used  for  Analyzing  Data 

The  questions  posed  in  Chapter  1  of  the  study  are  again 
listed  below  together  with  the  methods  used  for  analyzing 
the  data;  special  attention  has  been  paid  to  how  well  the 
monitoring  of  an  holistic  scoring  reflects  the  "true 
community  of  assent"  as  noted  by  White  (1985). 


1.  Do  the  mean  scores  for  the  essays  differ  when  the 
papers  are  evaluated  by  readers  working  in  a 
monitored  setting  from  when  they  are  judged  by  the 
readers  working  independently? 


74 


To  answer  question  1,  an  analysis  of  variance,  equal 
cell  size  mixed  model,  was  used  on  a  Biomedical  program. 
The  design  was  randomized,  with  the  eight  regular  readers 
comprising  the  repeated  measure.  (The  four  special  readers 
who  completed  the  protocols  were  not  included  in  any 
statistical  analysis  as  they  had  scored  far  fewer  essays 
than  had  the  eight  regular  readers.)  The  model — P,  T, 
E(P),  R(T) — included  the  following  four  random  factors:  P, 
signifying  the  number  of  essay  pairs  (51);  T,  representing 
the  number  of  tables  of  readers  (2);  E(P),  signifying  the 
monitored  versus  unmonitored  essays  nested  within  each 
pair  (2);  and  R(T),  representing  the  number  of  readers 
nested  within  each  table  (4).  Three  Quasi  F  ratios  were 
calculated  according  to  the  formula  of  B.  J.  Winer  (1971) 
in  Statistical  Principles  in  Experimental  Design  for  pairs 
(Fp),  tables  (Ft),  and  pairs  within  tables  (Fp(t)). 

2.  Do  experienced  readers  participating  in  a  monitored 
scoring  achieve  greater  agreement  with  each  other 
than  when  they  evaluate  essays  independently? 

To  answer  question  2,  Cronbach's  alpha  was  used  to 

indicate  the  degree  of  interrater  reliability  under  the  two 

different  scoring  conditions.   The  interreliability  rate 

was  instrumental  in  showing  both  the  extent  to  which 

monitoring  helped  readers  score  alike  and  the  extent  to 

which  readers  may  have  internalized  the  standards . 


3.  What  impact  do  the  chief  readers  have  on  an  holistic 
scoring?  How  do  they  ensure  both  a  reliable  and  a 
collegial  reading? 


75 


For  question  3,  the  comments  which  the  chief  readers 
made  during  the  training  sessions  were  examined,  and  the 
check  reading  results  were  reviewed.  In  addition,  because 
the  chief  readers  had  scored  all  the  essays  beforehand  as 
part  of  their  task  in  the  unmonitored  setting — a  task  not 
traditionally  associated  with  their  role — a  second 
Cronbach's  alpha  was  used  to  determine  how  well  their 
scores  correlated  with  those  of  the  readers .  The  chief 
readers'  comments,  together  with  these  scoring  results, 
helped  to  indicate  to  what  extent  the  chief  readers  were 
able  to  guide  readers  into  assenting  or  "owning, "  as  White 
(1985)  indicates,  the  standards  of  the  group. 

4.  What  criteria  do  readers  use  in  assigning  different 
score  levels?  What  standards  are  reflected  in  the 
score  levels  assigned  across  the  essays?  How  do 
readers  respond  to  these  standards? 

For  the  fourth  category  of  questions,  information  from 

the  eight  regular  readers'  logs  was  transferred  to  a 

Database  3  program;  the  readers'  written  comments  were 

grouped  in  categories  similar  to  those  on  Part  I  of  the 

questionnaire   (e.g.,   rhetoric,   mechanics,   grammar  and 

usage).   The  database  program  (see  Appendix  A  for  a  sample 

entry)   not  only  indicated  through  pluses  and  minuses 

whether  the  readers'  comments  were  positive  or  negative 

but  also  allowed  for  paraphrases  of  each  comment  to  be 

included.   The  database  program  was  used  to  tally  the 

positive  and  negative  responses  the  readers  made  in  their 


76 


logs  at  each  score  level;  the  program  was  also  used  to 
determine  the  exact  nature  of  responses — e.g.,  whether 
rhetorical,  mechanical,  or  grammatical — which  readers  gave 
to  papers  at  varying  score  levels.  The  audiotaped  proto- 
cols provided  further  corroboration  of  these  criteria. 

An  English  teacher  with  extensive  training  and 
experience  in  teaching  writing  served  as  an  outside  expert 
to  validate  independently  the  accuracy  of  the  database 
logs.  She  randomly  reviewed  20%  of  the  logs  from  each  of 
the  two  scoring  conditions  and  compared  the  readers ' 
comments  against  each  database  entry.  Whenever  she  found 
any  errors,  the  database  entries  were  adjusted  accordingly 
before  any  analysis  was  done. 

5 .  Do  any  common  patterns  appear  in  the  scorers ' 
written  responses  to  the  essays,  or  do  their 
comments  underscore  the  individuality  of  each 
reader's  transaction  with  the  text?  Do  readers' 
holistic  judgments,  as  shown  by  their  written  or 
oral  responses,  correspond  to  the  writing  features 
they  rate  as  important  on  the  questionnaire? 

For  the  fifth  category  of  questions,  the  readers' 

comments — both  written  and  oral — were  studied  for  any 

common  patterns  that  might  emerge  in  either  the  monitored 

or  unmonitored  condition.   The  comments  were  examined  to 

see  whether  readers  giving  an  identical  score  to  the  same 

essay  cited  similar  or  different  reasons  for  doing  so — such 

as  organization,  fluency  of  sentence  style,  or  creativity. 

It  was  hoped  that  identifying  patterns  of  this  nature  would 


77 


help  to  answer  whether  a  sense  of  community  develops  to 
influence  readers'  perceptions. 

6  •  What  is  the  nature  of  the  monitoring  that  the 
readers  receive  during  a  scoring  as  reported  through 
the  logs  of  table  leaders  and  readers?  Do  the 
procedures  noted  in  these  logs,  together  with  the 
protocols  of  the  special  readers,  support  the 
readers'  perceptions  of  their  own  holistic  scoring 
processes  as  noted  on  the  questionnaire? 

For  question  6,  the  logs  of  the  three  table  leaders, 
together  with  the  "procedure  section"  of  the  readers'  logs 
(See  Appendix  A  for  a  sample  of  the  log)  and  the 
audiotapes,  were  examined  for  clues  to  the  nature  of 
monitoring.  It  was  hoped  that  the  logs  would  reveal  how 
directive  the  table  leaders  were  and  what  type  of 
relationship  existed  between  the  table  leaders  and  the 
readers.  Of  special  concern  were  how  the  readers  responded 
to  different  criteria  and  whether  the  data  supported  the 
readers '  perceptions  of  the  holistic  scoring  process  as 
reflected  through  their  responses  to  the  questionnaire. 

Thus,  both  qualitative  and  quantitative  data  were  used 
to  determine  whether  monitoring  in  an  holistic  scoring 
reflects  a  congenial  effort  among  scorers  to  arrive  at 
common  agreement  throughout  a  scoring  and  whether  this 
sense  of  community  affects  the  judgments  that  scorers  make. 

An  indepth  discussion  of  the  results  obtained  in  the 
study  is  presented  in  Chapter  4 . 


CHAPTER  4 
RESULTS  AND  DISCUSSION 


The  results  of  the  study  are  presented  according  to  the 
questions  raised  in  the  first  chapter.  The  first  three 
sections  deal  primarily  with  the  quantitative  results  and 
the  last  three,  with  the  qualitative  findings. 

Mean  Scores  in  the  Two  Scoring  Conditions 

Question  1:  Do  the  mean  scores  for  the  essays  differ 
when  the  papers  are  evaluated  by  readers  working  in  a 
structured  setting  from  when  they  are  judged  by  the  readers 
working  independently? 

When  the  mixed-model  analysis  of  variance  for  nested 
factors  and  repeated  measures  was  computed,  three  statis- 
tically significant  main  effects  were  found  and  no 
interactions.  Not  surprisingly,  as  shown  in  Table  2, 
statistically  significant  differences  were  found  (p  <  .05) 
among  the  pairs  of  essays.  That  is,  each  pair  of  matched 
essays  differed  from  the  next  pair  of  matched  essays.  Also 
not  surprisingly,  readers  nested  within  tables  differed  to 
a  statistically  significant  extent  (p  <  .001).  As  will  be 
seen  in  the  discussion  for  question  5,  the  qualitative  data 
highlighted  the  individuality  of  the  readers,  thereby 
confirming  these  differences  among  readers. 


78 


79 


TABLE  2 
Analysis  of  Variance  Mixed  Models  Source  Table 


Source 


Error    Sum  of      Degrees  of    Mean     F 
Term     Squares      Freedom     Square 


Mean 

4470 

71078 

1 

Pairs  [P] 

252 

91422 

50 

Tables  [T] 

7 

84314 

1 

Essays  within 
Pairs  [E(P)] 

ET 

P) 

72 

37500 

51 

Readers  within 
Tables  [R(T) ] 

PR 

T) 

5 

01471 

6 

Pairs  crossed 
with  Tables  [PT] 

11 

28186 

50 

Tables  crossed 
with  Essays 
within  Pairs 
[ET(P)] 

ER 

PT) 

14 

37500 

51 

Pairs  crossed 
with  Readers 
within  Tables 
[PR(T) 

ER 

PT) 

64 

23529 

300 

Error  [ER(PT] 


5.0583  "3.71* 

7.8431  "9.26 

1.4191  5.03*** 

0.8358  3.90** 

0.2256  ~  .93 

0.2819  1.12 

0.2141  0.85 


Note;   *  Signifies  Quasi  F  ratios. 
*  p  <  .05. 
**  p  <  .001. 
***  p  <  .00005. 


80 


What  seemed  especially  meaningful  for  the  purposes  of 
this  study  was  the  statistically  significant  difference 
(p  <  .00005)  found  for  essays  nested  within  pairs  E(P); 
these  essays  represented  the  two  conditions  of  monitoring 
and  non-monitoring.  The  overall  mean  for  the  51  monitored 
essays  was  2.279,  with  a  standard  deviation  of  .559, 
whereas  the  overall  mean  for  the  matched  set  of  51 
unmonitored  essays  was  2.401,  with  a  standard  deviation 
of  .690.  Thus,  not  only  was  the  mean  score  for  the 
unmonitored  essays  significantly  higher  than  the  mean 
for  the  monitored  essays,  but  more  of  a  spread  existed 
among  the  mean  scores  on  each  essay  in  the  unmonitored 
condition. 

The  overall  higher  mean  for  the  unmonitored  essays 
seemed  due  to  the  substantial  number  of  upper-half  scores 
(l's  and  4's)  awarded  papers  in  the  unmonitored  condition: 
Whereas  only  four  essays  out  of  the  monitored  set  of  51  had 
a  mean  of  3  or  better,  14  essays  out  of  the  unmonitored  set 
of  51  had  a  mean  of  3_  or  better.  Figures  1  and  2  depict 
the  breakdown  of  scores  by  reader  and  condition. 

As  can  be  seen,  readers  across  the  board  gave  fewer 
scores  of  4  in  the  monitored  condition  than  in  the 
unmonitored;  for  several  readers,  the  difference  was 
dramatic.  Admittedly,  the  zero  scores  of  4  for  some 
readers  in  Figure  2  is  misleading  in  that  virtually  all 
readers  gave  at  least  one  score  of  4  during  the  monitored 


81 


-*-1 


£X 

CO 

3 

C£ 

0 

LJ 
□ 

5-i 

CC 

rH 

LJ 

H 

ct: 

(13 

H 

LJ 

<y 

31 

> 

0 

i-> 

M 

_] 

o 

N-J    _J 

LH 

J  cn 

cn 

>-i 

a) 

03 

0 

^ 

U 

LJ 

> 

T3 

i— i 

d) 

CD 

H 

0 

cn 

+J 

Ld 

•M 

C£ 

c 

o 

o 
§ 

3 

ro  o 

CO 

_j 

. 

_] 

to 

QZ 

> 

L_ 

o 

0 

_j 

-p 

az 

•H 

H 

C 

O 

0 

E-< 

rj 

CD 
U 

Cn 
■H 
fa 


o 

O 

a 

o 

a 

LD 

O 

m 

CD 

m 

CN 

no 

i — i 

t — i 

82 


CD 

u 

•H 

fa 


83 


U 

■H 

Cm 


84 


scoring;  some  of  those  were  included  in  the  data  deleted  as 
deviant  when  the  independent  scoring  of  the  chief  readers 
showed  four  sets  to  be  clearly  mismatched.  However,  even 
if  these  sets  had  been  included,  the  monitored  papers  would 
still  have  had  only  half  as  many  3's  and  4's  as  the 
unmonitored  set.  This  finding  does  not  mean  that  the 
monitored  scorers  were  awarding  only  lower-half  scores,  for 
several  readers  gave  3+  scores  in  their  logs  to  show  they 
perceived  some  essays  to  be  especially  strong.  However, 
such  pluses  and  minuses  could  not  be  included  in  the  data 
analysis  because  the  actual  scoring  of  an  essay  allows  for 
only  a  numerical  score.  Thus,  the  fact  remains  that  during 
the  monitored  scoring,  the  spread  of  scores  was  tightened 
and  the  mean  lowered.  Why  this  tendency  should  have 
occurred  is  intriguing.  One  possible  explanation  lies  with 
the  studies  of  Breland  and  Jones  (1984)  and  of  Sweedler- 
Brown  (1985),  who  found  that  experienced  scorers  tended 
to  score  more  strictly  than  did  less  experienced  scorers . 
Perhaps  this  study  reflected  a  similar  trend  with  the 
training  procedures  and  the  monitoring  by  table  leaders 
lowering  some  individual  reader  scores  as  readers  strictly 
adhered  to  the  criteria  under  the  monitored  condition. 

In  fact,  readers  indicated  on  their  questionnaires 
(see  the  open-ended  question  after  item  24  in  Appendix  A) 
that  they  tended  to  grade  timed  writings  more  leniently 
than  they  did  papers  written  outside  class.    Without 


85 


rangefinders  or  sample  papers  to  measure  the  actual  essays 
against  during  the  unmonitored  scoring,  some  readers,  such 
as  Readers  IB  and  1C,  may  have  awarded  papers  higher  scores 
than  they  did  the  matched  essays  during  the  monitored 
scoring  when  group  standards  became  a  constant  focus  of 
attention.  One  reader  even  wrote  in  her  unmonitored  logs 
of  several  instances  in  which  she  would  have  consulted 
a  table  leader  if  she  could  have,  and  another  reader 
expressed  regret  at  not  having  rangefinders  to  examine. 
Still  others  noted  during  their  unmonitored  scorings  that 
they  consulted  their  operational  definitions.  Thus,  during 
the  unmonitored  scoring,  some  readers  felt  the  need  for 
standards  to  anchor  their  evaluations  against. 

Part  of  the  explanation  for  the  lower  mean  score  of  the 
monitored  scoring  may  lie  with  the  nature  of  the  monitoring 
itself.  That  is,  in  the  course  of  either  scoring  training 
samples  together  or  individually  discussing  specific  papers 
with  table  leaders,  readers  may  have  become  more  attuned  to 
problems  than  when  they  were  reading  the  essays  impress ion- 
istically  on  their  own.  Indeed,  most  readers  commented  on 
their  guest ionnaires  (item  50)  that  they  tended  to  view 
problematic  papers  both  holistically  and  analytically.  In 
this  sense,  even  the  actual  scoring  process  for  this  study 
may  have  contributed  to  a  more  analytic  scoring  than  usual 
in  that  readers  were  asked  to  note  in  their  logs  the 
elements  to  which  they  were  responding. 


86 


The  logs  of  the  table  leaders  also  suggest  that 
the  training  process  may  have  contributed  to  the 
significant  difference  in  mean  scores  for  the  matched 
sets.  The  logs  showed  that  on  those  four  occasions  in 
which  two  readers  changed  their  scores  after  talking 
to  their  table  leaders,  the  readers'  scores  were  lowered, 
rather  than  raised.  Similarly,  even  though  one  check- 
reading  paper  was  returned  to  a  reader  because  it  had 
been  scored  too  low,  three  others  were  returned  to  two 
readers  and  to  one  table  leader  because  they  had  been 
scored  too  high.  It  is  conceivable  that  those  readers 
who  lowered  their  scores  after  they  reviewed  the  essays 
under  debate  may  have  had  their  subsequent  scores 
influenced — at  least  for  a  short  period  afterward — by 
this  experience;  for  example,  many  readers  indicated  on 
their  questionnaires  (item  42)  that  the  return  of  a  paper 
"sometimes"  affected  their  subsequent  scoring  processes. 
Admittedly,  only  a  few  readers  were  involved  with  returns; 
hence,  such  an  explanation  has  limited  application. 
Nevertheless,  both  the  qualitative  data  and  the  quan- 
titative data — which,  as  indicated  by  Figure  2,  show 
three  readers'  scores  moving  downward  and  five  readers' 
scores  clustering  in  the  middle — illustrate  a  stricter 
adherence  to  criteria  in  the  monitored  condition  than 
in  the  unmonitored. 


87 


Interrater  Reliability 

Question  2:  Do  experienced  readers  participating  in 
a  structured  scoring  achieve  greater  agreement  with  each 
other  than  when  they  evaluate  essays  independently? 

Cronbach's  alpha  was  used  to  determine  the  extent  of 
agreement  among  the  eight  readers  in  both  scoring 
conditions .  In  the  unmonitored  condition  the  alpha  was 
.936  for  the  51  essays;  in  the  monitored  condition  the 
alpha  was  .915  for  the  matched  set  of  51  essays.  Thus, 
in  both  conditions  the  interrater  reliability  was  high, 
and  the  readers  scoring  the  essays  independently  appeared 
to  achieve  equally  great,  if  not  slightly  greater, 
agreement  with  each  other  than  when  they  scored  essays 
as  a  group. 

As  can  be  seen  from  Table  3,  no  one  reader  appeared  to 
affect  this  high  interrater  reliability  coefficient 
substantially:  That  is,  if  individual  readers  had  been 
removed  from  the  analysis,  the  lowest  alpha  in  the 
unmonitored  scoring  would  still  have  been  .92;  similarly, 
the  lowest  alpha  in  the  monitored  scoring  would  still  have 
been  .894  if  individual  readers  had  been  removed.  In  the 
unmonitored  scorings,  Readers  1C  and  2C  had  the  lowest 
correlation  of  .74  and  .73  respectively  with  the  other 
readers,  whereas  in  the  monitored  scoring,  Readers  1A  and 
1C  had  the  lowest  correlation  of  .67  and  .56  respectively 
with  the  other  readers .  These  correlations  substantially 


88 


TABLE  3 
Cronbach's  Alpha  of  Readers'  Scores 


Reader 

Scale 

Mean 
If  Item 
Deleted 

Scale 
Variance 
If  Item 
Deleted 

Corrected 

Item-Total 

Correlation 

Alpha 
If  Item 
Deleted 

Unmonitored  Sc 

Drinq 

1A 

16.902 

25.050 

.789 

.927 

IB 

16.549 

23.093 

.759 

.929 

1C 

16.667 

23.627 

.741 

.930 

ID 

16.726 

23.443 

.811 

.924 

2A 

16.882 

23.346 

.872 

.920 

2B 

16.824 

24.468 

.789 

.926 

2C 

16.863 

25.561 

.733 

.931 

2D 

17.098 

23.850 

.750 

.929 

Monitored  Scorinq 

1A 

15.863 

16.521 

.671 

.908 

IB 

15.745 

14.954 

.725 

.904 

1C 

15.824 

16.628 

.563 

.916 

ID 

16.020 

15.060 

.736 

.903 

2A 

16.020 

15.660 

.801 

.898 

2B 

16.196 

15.961 

.750 

.902 

2C 

15.941 

15.817 

.725 

.904 

2D 

16.039 

14.278 

.831 

.894 

89 


exceed  the  .31  correlation  among  all  the  untrained  readers 
in  the  study  of  Diederich  et  al .  (1961);  they  exceed  the 
.41  correlation  among  the  English  teachers  in  that  same 
study. 

That  the  readers  of  this  study  seemed  to  agree  so 
strongly  among  themselves  in  the  unmonitored  scoring 
condition  suggests  that  they  had,  from  their  years  of 
scoring  together,  undoubtedly  internalized  the  standards. 
Still  another  contributing  factor  may  be  the  provision  of 
operational  definitions  for  the  readers '  use  during  the 
unmonitored  scoring;  in  this  respect,  the  independent 
scoring  condition  differed  substantially  from  the  at-home 
scoring  in  the  study  by  Diederich  et  al.  (1961),  in  which 
readers  were  given  few  directions  and  no  criteria  on  which 
to  base  their  judgments. 

Thus,  even  though  the  readers  of  this  study  had  no 
table  leaders  to  whom  to  turn  for  guidance,  and  even  though 
they  had  no  rangefinders  or  sample  papers  written  on 
the  applicable  topics,  the  readers  could  consult  the 
definitions  for  each  score  level;  in  fact,  the  logs  and 
tapes  indicated  that  several  readers  did  indeed  do  so. 

At  the  same  time,  these  results  must  be  interpreted 
with  caution.  Because  Cronbach's  alpha  was  used  with  a 
substantial  number  of  readers — namely,  eight — the  reli- 
ability rate  is  undoubtedly  higher  than  might  have  occurred 
if  the  scores  of  only  two  readers  had  been  correlated  as 


90 


happens  in  a  typical  scoring.  Furthermore,  because  the 
alpha  was  simultaneously  comparing  all  the  readers'  scores 
for  51  essays,  it  masked  the  possibility  of  discrepant 
scores  occurring  on  individual  essays.  For  example,  as 
shown  by  Figure  3,  the  potential  for  split  scores  was 
nearly  twice  as  high  in  the  unmonitored  scoring  as  in  the 
monitored  scoring:  That  is,  if  two  readers  in  the  unmon- 
itored scoring  had  been  paired  against  each  other  on  a 
given  essay — as  happens  in  an  actual  scoring — then  on  33.3% 
of  the  51  essays  in  the  unmonitored  scoring,  discrepant  or 
noncontiguous  scores  might  have  arisen.  In  contrast,  if  two 
readers  had  been  paired  against  each  other  in  the  monitored 
scoring,  on  15.6%  of  the  papers,  discrepant  or  non- 
contiguous scores  conceivably  could  have  occurred.  To  put 
the  findings  another  way,  8  of  the  51  papers  in  the 
monitored  condition  received  3  of  4  possible  scores;  the 
remaining  43  scores  were  either  identical  or  contiguous. 
On  the  other  hand,  in  the  unmonitored  condition  17  of  the 
51  papers  received  3  of  4  possible  scores;  the  remaining  34 
scores  were  either  identical  or  contiguous .  Thus ,  the 
monitored  scoring  clearly  reduced  the  potential  for  having 
split  scores  arise. 

Impact  of  Chief  Readers  on  Scoring 


Question  3:  What  impact  do  the  chief  readers  have  on 
an  holistic  scoring?  How  do  they  ensure  both  a  reliable 
and  a  collegial  reading? 


91 


3 

to 

CD 


t — 1 
O 

a) 
u 

Q_ 

0 

u 

m 

en 

a 

M 

LJ 

O 

E-i    Q- 

0 

ZZ.    >i 

en 

LJ    H 

CJ 

az  lj 

~"i  m 

a  a 

+j 

en  o 

■H 

in 

a 

a) 
& 

■H 
fa 


92 


To  determine  the  effect  of  the  chief  readers'  scores 
on  the  reliability  of  the  reading,  a  second  Cronbach's 
alpha  was  run.  The  chief  readers  in  this  study  had,  prior 
to  the  monitored  scoring,  independently  rated  all  the 
essays;  for  81%  of  the  time  their  scores  with  each  other 
were  identical,  and  for  the  remaining  19%  of  the  time  their 
scores  were  contiguous,  as  when  one  chief  reader  decided  a 
paper  was  a  weak  3  and  the  other  chief  reader  described  it 
as  an  upper  2 . 

When  the  chief  readers'  scores  were  included  in  the 
monitored  condition,  the  alpha  was  .9358.  When  the  chief 
readers'  scores  were  included  in  the  unmonitored  condition, 
the  alpha  was  .9474.  As  might  be  expected,  therefore,  the 
inclusion  of  the  chief  readers'  scores  did  not  substan- 
tially affect  the  first  Cronbach's  alpha.  However,  the 
second  alpha  did  reveal  the  extent  to  which  the  eight 
individual  readers '  scores  corresponded  to  the  chief 
readers'  ratings,  whose  scores  are  sometimes  viewed  as 
the  "true"  scores.  This  interpretation  of  chief  readers' 
scores  as  "true"  does  not  mean  that  chief  readers  are 
infallible  in  their  scoring;  however,  their  experience  with 
and  commitment  to  the  standards,  their  involvement  in  all 
phases  of  an  holistic  scoring  from  sample  selection  to  the 
refereeing  of  discrepant  essays,  and  their  responsibility 
for  ensuring  that  each  scoring  runs  effectively  lend 
particular  credence  to  most  chief  readers '  ratings . 


93 


As  can  be  seen  from  Table  4,  the  correlations  with  the 
average  of  the  chief  readers '  scores  appear  lower  in  the 
monitored  condition  than  in  the  unmonitored.  However,  the 
overall  correlation  may  mask  what  occurred  on  individual 
essays.  For  example,  the  number  of  essays  on  which  readers 
gave  identical  scores  to  those  of  the  chief  readers  was 
higher  for  each  reader  in  the  monitored  condition  than  in 
the  unmonitored  condition.  Moreover,  the  number  of  readers 
who  disagreed  with  the  chief  readers'  scores,  as  well  as 
the  number  of  actual  essays  on  which  readers'  scores 
differed  the  most  from  those  of  the  chief  readers  (by  more 
than  one  point),  was  smaller  in  the  monitored  condition 
than  in  the  unmonitored  condition  when  readers  were, 
virtually,  on  their  own.  These  results  suggest  that  in  the 
monitored  setting,  the  readers  were  more  apt  to  score  the 
same  way  as  the  chief  readers  than  they  were  likely  to  do 
when  scoring  at  home.  This  finding  is  not  surprising  in 
that  during  the  monitored  scoring,  the  chief  readers  were 
able  to  make  their  judgments  known  through  the  samples,  the 
check  readings,  and  their  frequent  interactions  with  the 
table  leaders . 

Qualitative  Findings  About  the  Chief  Readers'  Influence 

The  qualitative  data,  together  with  the  researcher's 
observation  of  the  monitored  scoring,  illustrated  some  of 
the  interaction  between  the  chief  reader,  the  associate 


94 


TABLE  4 
Cronbach's  Alpha  of  Readers'  and  Chief readers '  Scores 


Correlation 
with  Avg. 
of  C.R.s' 
Reader   Scores 


#  of  Essays  #  of  Essays  #  of  Essays  #  of  Essays 

with  Scores  with  Scores  with  Scores  Differing  by 

Identical  to  1/2  to  1  1/2  to  1  by  More  Than 

C.R.s'  scores  pt.  LOWER  pt.  HIGHER  1  Point 


Monitored  Condition 


1A 

.63** 

IB 

.54** 

1C 

.41* 

ID 

.75** 

2A 

,  75** 

2B 

.62** 

2C 

.62** 

2D 

.67** 

31 
24 
29 
34 
35 
27 
29 
28 


10 
10 
7 
13 
11 
21 
12 
15 


10 

15 

11 

4 

5 

3 

10 

7 


Unmonitored  Condition 


1A 

.74** 

IB 

.67** 

1C 

.65** 

ID 

.65** 

2A 

.73** 

2B 

.61** 

2C 

.67** 

2D 

.68** 

Note: 

Readers 

in 

this 

*  p  < 

.01 

**  p  < 

.001. 

26 
20 
20 
21 
23 
21 
25 
16 


16 
9 
12 
14 
18 
17 
14 
25 


9 

18 
16 
14 
10 
12 
11 
9 


Readers'  scores  could  differ  from  chief  readers'  scores  by  half  a  point 
in  this  computer  program  because  the  chief  readers'  scores  were  averaged. 


95 


chief  reader,  and  the  table  leaders  and  readers  in  the 
study.  Although  the  researcher's  viewpoint  is  not 
entirely  objective  in  that  she  has  been  involved  with 
holistic  scorings  for  a  number  of  years,  she  simply 
recorded  as  objectively  and  as  comprehensively  as  possible 
what  took  place  externally  during  the  monitored  holistic 
scoring.  The  scoring  session  began  with  the  six  range- 
finders,  which  comprise  papers  selected  as  representative 
of  the  four  scoring  levels.  All  12  readers  were  involved 
with  the  training;  the  special  readers  of  Table  3  who  were 
taping  essay  subsets  then  adjourned  to  private  offices  to 
record  their  reactions  to  essays,  returning  after  breaks 
to  participate  again  in  each  training  session.  Each  reader 
was  asked  to  rank  order  these  six  essays  in  terms  of 
quality,  assigning  one  score  from  each  score  level 
(1  through  4)  to  at  least  one  essay.  The  scores  were  then 
publicly  tallied,  and  a  brief  discussion  ensued  in  which 
the  table  leaders  talked  with  readers  about  the  essays. 
The  central  role  such  papers  play  in  a  reading  was 
inadvertently  conveyed  in  the  taped  comment  of  Reader  3A, 
who,  during  a  period  of  hesitancy  in  her  unmonitored 
scoring  of  essays  at  home,  stated: 

I  realize  at  this  time  that  I  miss  the  range- 
finders.  Starting  out  just  kind  of  cold  with 
these  first  five  papers,  I  seem  to  have  a  tendency 
to  develop  a  range  among  these  papers,  and,  uh,  of 
course  that  really  can't  be  done.  ...  I  can  see 
now  what  the  purpose  of  the  rangefinders  is — to 
get  an  idea  in  my  mind  as  to  what  I'm  looking  for 
in  the  different  scores. 


96 


Thus,  while  operational  definitions  are  important  in 
describing  typical  characteristics  of  essays  at  each  score 
level,  the  rangefinders  stand  as  actual  essays  which 
exemplify  for  current  topics  the  scoring  criteria. 

As  shown  in  Figure  4  in  which  circled  numbers  represent 
the  number  of  readers  assigning  the  accurate  score,  agree- 
ment among  the  12  readers  on  the  rangefinders  was  high.  In 
addition  to  presenting  the  rangefinders,  the  chief  reader 
introduced  11  pairs  of  samples  to  readers  at  set  intervals. 
Again,  as  the  partial  set  indicates  in  Figure  4,  agreement 
was  consistently  high. 

The  chief  reader  made  general  comments  about  the 
samples  and  rangefinders.  For  example,  he  noted  that  paper 
FF  was  not  a  "great  paper"  although  he  called  attention  to 
one  positive  feature  about  its  structure.  He  agreed  with 
the  readers  that  sample  0  was  certainly  a  1,  and  he 
observed  that  it  made  a  good  training  paper  in  that 
more  blase'  readers  would  choose  not  to  struggle  with  it. 
He  called  attention  to  the  deteriorating  quality  in 
sample  N  by  noting  that  the  first  page  was  upper-half,  the 
second  page  a  2,  and  the  third  page  almost  a  1;  he  agreed 
that  U  was  indeed  a  3/2  paper  as  the  presence  of  only  one 
paragraph  made  it  troublesome  to  some  readers .  Never 
singling  out  readers  who  were  off  target,  he  suggested  that 
a  score  which  was  too  high — as  occurred  in  one  instance 
with  rangefinder  LL — was  "charitable"  or  a  score  which  was 


97 


Scores 


Total 


Range finders  D 

I 
M 
T 
W 

LL 

Samples  FF 

0 

JJ 

CC 

E 

N 

U 

BB 

V 


12 
12 
12 
12 
12 
12 

12 
12 
12 
12 
12 
12 
12 
12 
12 


Figure  4.    Results   of   rangefinders   and   samples   in   the 
monitored  scoring. 


98 


too  low  was  rather  harsh.  Through  these  means  he  fulfilled 
what  White  (1985)  calls  the  "heavy  responsibility"  of 
leaders  "to  ensure  a  reliable  essay  reading  while  at  the 
same  time  respecting  the  professionalism,  good  will,  and 
individuality  of  the  readers  who  are  grading  the  papers" 
(p. 31). 

Except  for  the  rangef inders,  for  which  the  chief  reader 
allowed  a  few  minutes  of  conference  time  between  the 
readers  and  table  leaders,  group  discussions  of  the  samples 
rarely  occurred.  The  purpose  of  the  training  samples  was 
clearly  to  have  readers  ascertain  where  they  stood  in 
relation  to  the  other  readers  in  the  scoring  and  tallying 
of  the  same  papers . 

During  the  course  of  the  scoring,  the  chief  readers 
conducted  two  check  readings.  Each  check  reading  consisted 
of  a  random  set  of  eight  essays  which  were  independently 
scored  by  a  reader,  a  table  leader,  and  a  chief  reader, 
none  of  whom  knew  the  other  scores .  To  have  identical 
scores  among  all  three  was  the  goal;  however,  contiguous 
scores  were  considered  acceptable  if  pluses  and  minuses 
on  the  record  sheet  suggested  that  the  readers '  scores 
approximated  the  chief  reader's  score.  Thus,  a  score  of 
3  by  a  reader  would  be  acceptable  if  the  chief  reader's 
score  was  a  high  2;  conversely,  if  the  chief  reader  or 
associate  chief  reader  had  perceived  an  essay  as  a  good 
3,  and  the  check  reading  showed  a  reader  giving  it  a  2, 


99 


then  the  paper  would  be  returned  for  a  suggested  rereading. 
During  the  monitored  scoring  only  one  noncontiguous  score 
arose;  however,  the  chief  readers  returned  four  papers  to 
the  tables,  asking  either  a  reader  or  a  table  leader  simply 
to  review  the  essay  and  reconsider  the  score.  Thus, 
through  the  reading  of  common  samples  and  through  check 
readings,  the  chief  readers  insured  that  everyone  would 
remain  aware  of  group  standards . 

Writing  Criteria  Across  Different  Scoring  Levels 

Question  4:  What  criteria  do  readers  use  in  assigning 
different  score  levels?  What  standards  are  reflected  in  the 
score  levels  assigned  across  the  essays?  How  do  readers 
respond  to  these  standards? 

In  answer  to  question  4,  the  questionnaires,  logs,  and 

tapes  were  examined  for  insights  into  readers'  attitudes 

toward  holistic  scoring  in  general  and  to  the  standards 

used  in  the  CLAST  program,  in  particular.  As  Table  4 

indicates,  14  of  the  17  study  participants  acknowledged  on 

the  questionnaire  that  they  "always"  or  "almost  always" 

endorsed  the  evaluation  of  written  products.    Thirteen 

agreed  that  they  "always"  or  "almost  always"  endorsed  the 

concept  of  scoring  papers  as  a  whole.  Eleven  stated  that 

they  used  timed  writings  at  least  occasionally  in  their 

classrooms;  nine  admitted  that  at  least  sometimes  they 

used  holistic   scoring   to   evaluate   classroom   papers. 

Conceptually,  then,  readers  of  this  study  supported  the 

value  of  holistic  scoring. 


100 


Thirteen  participants  also  strongly  agreed  that  they 
felt  comfortable  with  the  standards  adopted  for  CLAST, 
although  two  others  stated  that  they  believed  the  standards 
were  too  low.  No  matter  what  their  attitude  toward  the 
standards,  all  the  participants  with  one  exception  said 
they  rarely  had  difficulty  adhering  to  group  standards. 
Readers  concurred  far  less  readily  about  whether  they  had 
expectations  of  what  a  CLAST  paper  should  look  like: 
Whereas  eight  agreed  that  they  almost  always  had  such 
expectations,  six  noted  that  they  occasionally  did,  and 
three  others  wrote  that  they  seldom,  if  ever,  did.  In 
fact,  Table  Leader  1  wrote  in  response  to  this  question 
that  she  "assess [ed]  the  writing  on  the  basis  of  the  work 
present . "  The  variety  of  readers '  responses  did  not 
support  Roberts'  assertion  (1983)  that  readers  envisioned 
an  idealized  text  to  which  they  compared  student  essays. 

While  the  questionnaire  responses  indicated  that 
readers  basically  supported  both  the  concept  of  holistic 
scoring  and  the  actual  standards  used,  the  logs  and  tapes 
showed  that  applying  group  standards  to  actual  papers  was 
not  easy.  Some  essays  presented  special  difficulties  for 
readers .  Not  only  did  the  tapes  reveal  several  readers ' 
struggles  to  resolve  whether  such  papers  should  be  scored 
up  or  down,  but  the  logs  also  reflected  a  similar  process 
of  adjustment  through  some  readers'  use  of  pluses,  minuses, 
and  arrows.    In  an  article  on  criteria  for  determining 


101 


writing  proficiency,  Shaughnessy  (1980)  called  attention  to 
"the  almost  infinite  number  of  possible  combinations  of 
strengths  and  weaknesses"  (p. 118)  which  readers  must 
balance  in  an  attempt  to  decide  whether  a  paper  is 
incompetent.  This  study  showed  a  similar  balancing  process 
occurring  at  all  score  levels.  For  example,  Reader  3B's 
audiotaped  comment  about  paper  024  revealed  a  typical 
struggle:  "I  don't  know.  I  wish  there  weren't  so  many 
errors  and  yet  it  has  so  much  imagery.  It  is  well  stated, 
and  it  is  informative  and  thought  provoking.  I  think  I'll 
go  ahead  and  give  it  a  4_.  "  Reader  3C  experienced  a  similar 
difficulty  with  essay  052:  "It's  just  a  tough  choice 
between  a  3_  and  a  2 — the  2  because  of  the  grammar  problems. 
.  .  and  a  3.  because  this  person  uses  ...  is  very  specific 
with  a  lot  of  detail,  has  a  nice  flair  for  writing,  a  nice 
style.  The  paper  is  just  appealing.   It's  a  real  toss-up." 

The  same  term  "toss-up"  appeared  in  Reader  3A's  tapes, 
as  he,  too,  remarked  about  the  "tough  line"  involved  in 
distinguishing  between  3/2  papers.  As  will  be  discussed 
under  guestion  5,  Reader  3A,  like  Reader  3D,  perceived  the 
debate  in  terms  of  rewarding  a  paper  for  its  strengths  or 
punishing  an  essays  for  its  errors.  For  some  essays,  they 
both  speculated  as  to  what  score  the  second  reader  might  be 
likely  to  give. 

The  tapes  revealed  that  Readers  3B  and  3C  mentally 
rank-ordered  such  troublesome  essays,  comparing  them  to 


102 


previous  papers  or  to  the  operational  definitions.  Once 
during  the  unmonitored  scoring.  Reader  3C  went  back  to  a 
previous  paper  and  raised  its  score;  confessing  that  she 
knew  she  was  not  supposed  to  alter  her  original  evaluation, 
she  stated,  nevertheless,  "The  contrast  in  these  papers  is 
so  great,  and  I  feel  so  strongly  about  this  being  a  2_,  that 
I  just  can't  possibly  see  giving  the  other  one  a  2  when  it 
was  so  exact  in  detail."  Clearly,  then,  those  papers  which 
did  not  fit  the  definitions  of  specific  score  levels  or 
which  contained  discrepancies  between  form  and  content  gave 
even  these  highly  experienced  scorers  difficulty. 

Standards  at  Score  Levels 

Although  some  essays  presented  special  difficulties  in 
scoring,  several  patterns  were  clearly  discernible  in 
papers  at  different  score  levels.  As  might  be  expected 
from  their  scarcity,  essays  given  scores  of  4.  were  viewed 
as  strong  papers .  Favorable  comments  in  the  logs  or  on 
tapes  centered  on  the  guality  of  ideas,  the  solid 
development  of  4,-level  essays,  the  good  organization,  and 
the  coherence  that  typified  the  best  papers.  In  their 
emphasis  on  such  qualities,  the  readers  of  this  study 
resembled  those  in  Diederich,  French,  and  Carlton's  study 
(1961),  in  Freedman's  (1979),  and  in  Breland  and  Jones's 
(1984).  To  a  lesser  extent,  readers  noted  the  mature 
diction  or  the  sentence  variety  that  often  existed  in 
4-level  essays.   Only  a  few  negative  comments  were  made 


103 


about  4.  papers,  especially  with  regard  to  "mechanical 
problems,"  the  umbrella  term  used  to  refer  to  a  variety 
of  grammatical  and  mechanical  errors;  however,  such 
problems  were  generally  deemed  minor. 

That  4's  were  rare  was  suggested  by  Reader  3B's 
references  to  the  "stellar"  qualities  she  expected  in 
top  papers;  Table  Leader  2  conveyed  a  similar  expectation 
in  her  observation  that  her  classroom  standards  were  higher 
than  those  for  CLAST  but  that  "A  CLAST  4  will  be  an  A  in 
my  class  any  day!"  (Notwithstanding  this  table  leader's 
comment,  it  is  important  to  note  that  the  four  points 
of  the  holistic  scoring  scale  used  in  this  study  are  not 
equivalent  to  the  letter  grades  of  A  through  D;  in  fact, 
scales  of  six  or  eight  points  are  often  used  in  holistic 
scorings  to  allow  for  finer  distinctions.) 

Reader  3C's  reasoning  about  paper  033,  to  which  she 
assigned  a  4,  revealed  the  high  standards  expected  for  such 
essays:  "The  vocabulary  is  very  good,  sentence  structure 
is  complex,  and  the  paper  seems  to  have  a  lot  of  depth  and 
carries  the  thought  all  the  way  through."  Reader  3D  gave 
the  same  paper  a  4_  because  the  detail  was  "sensible  and 
alive, "  fulfilling  his  expectations  as  a  reader;  to  him, 
the  overall  paper  was  "fluent,  articulate,  and  organized." 
Like  Reader  3C,  Reader  3A  commented  that  he  expected  to  see 
in  a  4.  paper  "something  that  shows  me  that  this  person's 
mind  is  in  the  top  quartile.   For  me  it's  distinctive 


104 


phrasing,  it's  inventive  details,  someone  showing  superior 
knowledge."  Emphasizing  that  such  papers  do  exist,  Reader 
3A  argued  that  readers  should  not  give  solid  3.  papers 
a  score  of  4_  simply  because  they  have  not  seen  an 
exceptionally  strong  essay  in  a  while;  rather,  he  observed, 
table  leaders  must  keep  readers  aware  of  that  distinction. 
These  readers '  perceptions  of  4.  papers  as  truly  outstanding 
or  distinctive  in  some  way  help  to  explain  why  relatively 
few  essays  were  assigned  that  rating. 

As  Figure  5  indicates,  readers  did,  however,  readily 
assign  scores  of  3_  to  papers  they  deemed  upper-half. 
Comments  recorded  in  the  logs  about  the  3_-  level  papers 
were,  as  with  the  4_-level  papers,  largely  positive  about 
content,  development,  organization,  style,  and  approach. 
However,  unlike  the  top  essays,  readers  often  noted  some 
problems  with  the  3  papers.  The  problems  were  varied, 
ranging  from  some  rhetorical  issues  of  focus,  organi- 
zation, or  style  to,  more  commonly,  the  mechanical  elements 
of  sentence  structure,  usage  errors,  sentence  errors, 
punctuation,  and  spelling.  Both  the  variety  and  the  number 
of  problems  noted  in  the  logs  clearly  differentiated  the 
3-level  papers  from  the  4-level  essays. 

Reader  3B's  taped  comment  about  paper  072  demonstrated 
the  evaluation  characteristic  of  3_  papers:  "It's  not  a  bad 
paper.  It's  well  developed.  There's  a  rational  argument. 
I  would  give  this  paper  a  3..   It  lacks  polish  sufficiently 


105 


* 

Essay  Score  ■  4 
(11  Entries) 
%  Visual  Display 

Positive 
Responses 

Qual.  Idea 

4 

36  ******* 

Focus 

0 

0 

Developmt . 

3 

73  ************** 

Org.  Struct. 

6 

55  *********** 

Style-Tone 

6 

55  *********** 

Approach 

0 

0 

Diction 

1 

9  * 

Sent.  Struc. 

0 

0 

Mech.  Prob. 

0 

0 

Sent.  Error 

0 

0 

Usage 

0 

0 

Dial.  ESL 

0 

0 

Punct.  Caps 

0 

0 

Spelling 

0 

0 

Length 

1 

9  * 

Handwriting 

0 

0 

Writr.  Role 

0 

0 

Overall 

1 

9  * 

Other 

0 

0 

Essay  Score  ■  3 
(144  Entries) 
%  Visual  Display   * 


Essay  Score  =  2 

(200  Entries) 
%   Visual  Display 


Essay  Score  =  1 

(53  Entries) 
%  Visual  Display 


9 

6 

4 

17 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

0 

0 

1 

1 

8 

6 

5 

3 

7 

4 

3 

2 

5 

3 

10 

5 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

0 

0 

2 

1 

0 

0 

0 

0 

0 

0 

1 

2 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

Negative 

Responses 

Qual.  Idea 

0 

0 

Focus 

0 

0 

Developmt . 

0 

0 

Org.  Struct. 

0 

0 

Style-Tone 

0 

0 

Approach 

1 

9  ' 

Diction 

0 

0 

Sent.  Struc. 

1 

9 

Mech.  Prob. 

2 

13 

Sent.  Error 

0 

0 

Usage 

1 

9 

Dial.  ESL 

0 

0 

Punct.  Caps 

0 

0 

Spelling 

0 

0 

Length 

0 

0 

Handwriting 

0 

0 

Writr.  Role 

0 

0 

Overall 

0 

0 

Other 

0 

0 

13 

9 

13 

9 

18 

13 

9 

6 

11 

8 

25 

17 

25 

17 

14 

10 

20 

14 

0 

0 

11 

8 

19 

13 

1 

1 

23 

12 

63 

32 

36 

18 

46 

23 

22 

11 

24 

12 

50 

25 

58 

29 

29 

15 

42 

21 

2 

1 

12 

6 

31 

16 

14 

7 

2 

1 

2 

1 

27 

14 

2 

1 

13 

25 

5 

9 

5 

9 

6 

11 

18 

34 

Each  asterisk  represents  5  percent. 


Figure    5.      Category   of    conunents    assigned   at   each    score   level. 

Monitored   Condition 


Essay  Score  =  4 

(41  Entries) 

» 

*  Visual  Display 

Positive 

Responses 

Qual.  Idea 

16 

39  *•*•*** 

Focus 

4 

10  ** 

Developmt . 

20 

49  »*******• 

Org.  Struct. 

18 

44  **«**•** 

Style-Tone 

7 

17  *** 

Approacn 

1 

2 

Diction 

6 

15  *** 

Sent.  Struc. 

13 

32  **««*• 

Mech.  Frob. 

3 

7  * 

Sent.  Error 

0 

0 

Usage 

1 

2 

Dial.  ESL 

0 

0 

Punct.  Caps 

0 

0 

Spelling 

0 

0 

Length 

0 

0 

Handwriting 

0 

0 

Writr.  Role 

3 

7  * 

Overall 

16 

39  ******* 

Other 

1 

2 

Negative 

Responses 

Qual.  Idea 

1 

2 

Focus 

2 

5  ' 

Developmt . 

0 

0 

Org.  Struct. 

0 

0 

Style-Tone 

1 

2 

Approacn 

2 

5  < 

Diction 

1 

2 

Sent.  Struc. 

0 

0 

Mech.  Prob. 

4 

10 

Sent.  Error 

2 

5 

Usage 

2 

5 

Dial.  ESL 

0 

0 

Punct.  Caps 

1 

2 

Spelling 

2 

5 

Length 

0 

0 

Handwriting 

0 

0 

Writr.  Role 

0 

0 

Overall 

0 

0 

Other 

0 

0 

106 


Essay  Score  =  3        Hssay  Score  =  2  Essay  Score  =  1 

(138  Entries)  (173  Entries)  (56  Entries) 

t    %  Visual  Display   ♦    »  Visual  Display    f    %  Visual  Display 


0    0  0    0 

0   0  oo 

0    0  0    0 


0  0  o 

1  1  o 
3  2  0 
5  3  0 


25 

18 

18 

13 

6 

4 

10 

7 

11 

8 

0 

0 

0 

0 

0 

0 

0 

0 

2 

1 

16 

12 

22 

16 

1 

1 

10 

7 

20 

14 

Each  asterisk  represents  5  percent. 


30 

17 

28 

16 

61 

35 

31 

18 

21 

12 

19 

11 

22 

13 

14 

25 

10 

18 

12 

21 

3 

5 

10 

18 

11 

20 

29 

52 

6 

11 

26 

46 

8 

14 

6 

11 

8 

14 

1 

2 

1 

2 

0 

0 

10 

18 

Figure    5. — (Continued) 

Unmonitored   Condition 


107 


to  prevent  any  higher  grade,  but  it  is  more  impressive  than 
many  other  papers."  In  addition,  the  logs  of  Reader  2B 
expressed  concerns  common  to  other  readers  of  _3-level 
papers.  For  example,  he  noted  about  paper  058,  "Good 
beginning,  text  related  well  to  topic,  some  awkwardness"; 
likewise,  he  wrote  in  reference  to  paper  093,  "Good 
development  of  thesis,  good  supporting  details — minor 
transitional  problems  and  errors."  Still  another  example 
of  the  range  of  comments  reflective  of  3-level  responses 
appears  in  Reader  2D's  notations  about  paper  041:  "Fairly 
solid  writing;  content  is  good  but  not  great,  conclusion 
adequate,  some  errors  in  language."  As  can  be  seen,  then, 
comments  about  3-level  papers  typically  acknowledged 
strengths  in  rhetorical  areas  and,  at  the  same  time, 
weaknesses  in  language  skills. 

Not  surprisingly,  papers  given  scores  of  2  reflected 
many  more  weaknesses  than  did  upper-half  papers.  As  Figure 
5  shows,  readers  made  some  positive  comments  regarding  the 
rhetorical  elements  of  content,  focus,  development,  and 
organization.  However,  their  comments  about  these  elements 
were  much  more  likely  to  be  negative  ones,  and  an  even 
greater  number  of  negative  remarks  focused  on  problems  with 
sentence  structure,  mechanical  problems,  and  usage  errors 
in  particular.  Spelling  errors  were  cited — both  on  the 
tapes  and  in  the  logs — but  unlike  some  studies,  such  as 
Grobe's  (1981),  in  which  spelling,  together  with  length, 


108 


was  one  of  the  most  commonly  noted  elements,  the  elements 

of  development,  sentence  structure,  mechanical  problems, 

and  usage  received  the  largest  percentage  of  responses . 

The  comments  of  Reader  IB  about  paper  053  were  typical 

of  the  responses  made  for  2-level  papers:    "Assertions 

repeated  rather  than  developed  and  supported.   Fundamental 

errors  in  spelling  and  grammar."   Reader  2B's  response  to 

paper  057  was  also  representative:   "Sentences  illogical, 

thesis  barely  relates  to  topic,  lack  of  detail." 

Readers  often  responded  negatively  to  the  quality  of 

thought,   to  a  shallowness  of  content  that  sometimes 

characterized  2-level  papers.   Reacting  to  one  student's 

statement  that  schools  sometimes  "choose  any  person  off  the 

street  to  come  in  and  teach  a  class,"  Reader  3A  noted  on 

his  tape,  "That  kind  of  extreme,  simple-minded  statement 

keeps  it  out  of  the  upper  half  for  me."   Thus,  2.-l®vel 

papers  were  often  perceived  as  pedestrian  or  mechanical. 

In  one  instance,  Reader  3D  speculated  about  the  probable 

cause  for  such  a  mechanical  quality: 

This  is  a  2  .  .  .  because  of  the  lack  of  detail. 
The  introduction  was  terrific .  I  wish  it  had 
followed  through  with  the  detail,  with  more 
examples.  Yes,  there  were  a  couple  of  comma 
splices,  and  I  don't  worry  too  much  about  that. 
But  it  does  indicate  [the  student]  was  trying  to 
hurry,  trying  to  finish  the  exam,  get  it  over 
with,  so  that  he  or  she  could  get  on  to  whatever 
is  next. 

The  negative  comments  that  were  manifested  in  readers' 

responses  to  2_-level  papers  predominated  at  the  i-score 

level.   Occasionally,  some  positive  notes  appeared,  as  in 


109 


Reader  lB's  observations  about  paper  052,  "Too  many 
fundamental  errors  in  English — punctuation,  spelling, 
fragments.  But  details  are  good"  [italics  added]. 
Similarly,  Reader  3C  commented  about  the  "coherent 
introduction"  of  paper  112,  and  she  noted  in  another 
instance  that  the  student's  ideas  were  good  but  that  he 
or  she  simply  had  not  yet  mastered  English  sentences. 
Thus,  even  though  positive  comments  about  _l-level  papers 
were  rare,  a  few  occurred  under  both  scoring  conditions. 
In  this  respect,  the  holistic  scorers  of  this  study 
differed  from  those  of  Haswell's  (1988),  whose  across-the- 
board  agreement  as  to  bottom  papers  caused  him  concern 
about  the  stereotyping  and  the  oversimplification  such 
agreement  implied.  Noting,  for  example,  that  "the  error- 
ridden  and  unstylish  surface  of  bottom  writing  glares, 
shields  the  depths  where  the  complexities  are"  (p.  311), 
Haswell  argues  that  teachers  can  agree  on  the  worst  student 
writing  because  they  have  simplified  its  characteristics. 
Contrary  to  Haswell's  (1988)  finding,  the  readers  of 
this  study  responded,  as  Figure  5  suggests,  to  a  variety 
of  problems  in  JL-level  papers.  The  comment  of  Reader  IB 
about  paper  064  reflects  this  varied  response:  "Poor 
logic,  bad  grammar.  Poor  introduction.  Paper  has  little 
content,  much  confusing  repetition  of  phrases."  Similarly, 
Reader  2D's  notations  about  paper  074  indicate  a 
comprehensive  assessment  that  refutes  Haswell's  assertion 


110 


of  oversimplification:  "Errors  in  grammar,  punctuation, 
and  usage  gualify  it  for  a  1;  logic,  however,  is  even  more 
serious  problem,  development  inadequate."  Thus,  even 
though  the  .1-level  papers  received  1/s  primarily  because 
of  grammatical  and  mechanical  errors,  the  readers  seemed 
alert  to  rhetorical  qualities — or  the  absence  thereof — in 
these  papers . 

With  the  1-papers,  in  particular — the  score  of  which 
clearly  failed  a  student — the  question  arises  as  to  how 
conscious  readers  remained  concerning  the  consequences  of 
their  scoring  actions.  Certainly,  an  awareness  of  the 
writer  exerted  a  varying  impact  on  all  the  study 
participants.  For  example,  three  scorers  admitted  on  item 
51  of  the  questionnaire  that  their  perception  of  the  writer 
almost  always  affected  their  scoring;  Table  Leader  1  wrote 
that  the  voice  of  the  writer  often  influenced  her.  Five 
other  scorers  indicated  that  their  perception  of  the  writer 
"often"  or  at  least  "sometimes"  affected  their  judgments; 
the  remaining  nine  stated  that  it  "seldom"  or  "never"  did. 

Despite  the  varying  impact  that  the  scorers '  awareness 
of  the  writer  had  on  their  evaluations,  the  participants 
clearly  seemed  to  distinguish  between  the  responsibility 
of  their  task — namely,  the  assigning  of  a  score — and  the 
consequences  that  the  score  would  have  for  a  student.  In 
fact,  in  many  holistic  scoring  sessions,  the  chief  reader's 
initial  procedural  comments  often  urge  readers  to  make  that 


Ill 


exact  distinction.  On  item  43  of  the  questionnaire,  16  of 
the  17  participants  answered  that  they  "always"  or  "almost 
always"  were  able  to  separate  their  scoring  assignment 
from  the  implication  of  the  score  for  the  student.  (See 
Table  5.)  Only  one  reader  answered  that  just  "sometimes" 
could  he  make  that  distinction.  In  addition,  when  the  13 
participants  who  often  served  as  referees  were  asked  the 
additional  question  of  whether  they  could  separate  their 
refereeing  decisions  from  the  consequences,  12  indicated 
that  they  "always"  or  "almost  always"  could.  The  same 
reader  cited  above  said  that  he  "never"  could.  His 
different  viewpoint  is  understandable  in  that  he  was  a 
reader  who  talked  directly  to  the  students  on  the  tapes, 
and  he  saw  scores  in  terms  of  reward  and  punishment. 

Overall,  the  responses  suggest  that  the  scorers  of 
this  study  were  willing  to  suspend  their  own  standards 
in  support  of  the  group's.  Despite  the  readers'  best 
intentions  to  observe  the  standards,  and  despite  their 
sincere  efforts  to  do  so,  the  logs  and  tapes  indicate 
that  evaluating  essays  holistically  is  a  complicated  task. 
Often  the  essays  did  not  exactly  fit  the  operational 
definitions,  nor  did  they  always  match  the  rangef inders . 
Moreover,  each  score  level  was  broad,  comprising  a  range 
of  possibilities;  a  "high"  2  could  substantially  differ 
from  a  2   that  was  "looking  down."    Thus,  readers  had  to 


112 


TABLE  5 
Results  of  the  Questionnaire,  Part  III 


Questionnaire  Always/        Often/     Seldom/ 

Item  No.  Almost  Always   Sometimes     Never    Other 

31)  Endorse  evaluation     14  (82%)  2  (12%)      —       1(6%) 
of  written  products 

32)  Use  timed  writings      4  (24%)        11  (65%)    1(6%)    1(6%) 
in  own  classes 

33)  Use  holistic  scoring     5  (29%)  9  (53%)    1(6%)    2  (12%) 
to  assess  classroom 

papers 

34)  Believe  in  scoring     13  (76%)         4  (24%) 
papers  as  a  whole 

36)  Have  difficulty  in       —  1(6%)    16  (94%) 
adhering  to  group 

standards 

37)  Have  expectations       8  (47%)  6  (35%)     3  (18%) 
of  what  a  CLAST 

paper  should  look 
like 

43)  Can  separate  scoring   16  (94%)         1(6%) 
task  from  conse- 
quences for  the 
student 

45)  Feel  pressured  by       1(6%)  7  (41%)     8  (47%)    1  (  6%) 
the  speed  of  other 

scorers 

46)  Physical  comfort        2  (12%)         9  (53%)     6  (35%) 
affects  scoring 

47)  Feel  comfortable       13  (76%)  2  (12%)     2  (12%) 
with  the  standards 

for  CLAST 

58)  When  refereeing        11  (92%)*  —        1(8%) 

papers,  can 
separate  scoring 
task  from  conse- 
quences for  the 
student* 


*Only  the  12  participants  who  referee  responded. 


113 


balance  strengths  against  weaknesses  and  determine  quickly 
which  qualities — whether  negative  or  positive — predominated 
in  the  final  impression  an  essay  made. 

Patterns  Among  Readers 

Question  5 :  Do  any  common  patterns  appear  in  the 
scorers'  written  responses  to  the  essays,  or  do  their 
comments  underscore  the  individuality  of  each  reader's 
transaction  with  the  text?  Do  readers'  holistic  judgments, 
as  shown  by  their  written  responses,  correspond  to  the 
writing  features  they  rate  as  important  on  the 
questionnaire? 

In  answer  to  the  question  of  whether  patterns  occurred 
in  the  responses  of  individual  readers,  the  comments  which 
readers  made  in  the  logs  and  on  tapes  were  examined  under 
both  the  monitored  and  unmonitored  conditions .  Several 
patterns  appeared . 

Without  exception,  all  readers  commented  frequently  on 
the  extent  of  development  reflected  by  the  essays  overall. 
Repeatedly,  such  comments  as  "thin  on  development, "  "not 
enough  development  for  a  3.,"  or  "good  supporting  details" 
appeared  in  the  logs .  Similarly,  the  readers  often 
responded  to  the  quality  of  sentence  structure.  Their 
notations  included  such  phrases  as  "awkward  sentence 
structure,"  "clumsy  sentences,"  "some  sentences  confusing," 
"syntax  errors,"  or  "syntactic  sophistication."  To  varying 
degrees,  all  readers  commented  on  the  presence  of  errors  in 
some  essays — either  by  citing  the  specific  mistakes,  as  in 
"homonyms"  or  by  labeling  problems  with  some  umbrella 


114 


term,  such  as  "severe  language  problems"  or  "needs  proofing 
for  errors .  " 

The  taped  protocols  made  by  the  four  special  readers 
revealed  similar  concerns  about  development  and  sentence 
structure.  Reader  3D's  observation  that  "It's  such  a 
blanket  statement — I  really  wish  I  could  see  some  details 
here"  typified  the  special  readers'  concern  with 
development.  Most  of  the  special  readers'  comments 
addressed  the  lack  of  sufficient  or  in-depth  development, 
although  occasionally,  a  reader  would  make  a  positive 
remark,  as  when  Reader  3C  noted,  "I  liked  the  concrete 
detail  of  conversation  provided  in  that  paragraph."  Reader 
3C  also  called  attention  to  the  strengths,  as  well  as  to 
the  weaknesses,  of  particular  sentences.  For  example, 
after  reading  a  sentence  with  strong  parallelism,  she 
observed,  "Beautiful  sentence  there,"  and  she  commented 
frequently  on  varied  sentence  structure.  Reader  3B  likewise 
noted  individual  sentences,  as  reflected  in  her  remark, 
"That  sentence  improved  toward  the  end  and  got  rather 
nice."  More  often,  the  four  special  readers  commented  on 
the  negatives — on  the  awkwardness,  confusion,  tangled 
structure,  and  lack  of  flow  reflected  by  some  sentences. 
For  all  12  readers,  then,  as  for  the  readers  in  Sweedler- 
Brown's  study  (1985),  development  and  sentence  structure 
appeared  as  clear  and  consistent  concerns. 

At  the  same  time,  the  logs  and  the  tapes  revealed 
the  individuality  of  each  reader.   Although  the  readers 


115 


commented  on  a  diverse  multitude  of  features,  certain 
recurring  themes  appeared  in  the  responses  of  each  reader, 
suggesting  that  readers  brought  their  own  lenses  or  frames 
of  reference  to  each  essay.  Brief  portraits  of  each 
reader's  logs  or  tapes  will  illustrate  these  individual 
concerns . 

Reader  1A  commented  often  on  the  organization  and 
structure  of  a  paper,  using  the  term  "then/now  organi- 
zation" that  was  a  response  unique  to  her.  Several  times 
she  noted  the  clarity  of  a  thesis  and  the  effectiveness  of 
transitions,  and  she  called  attention  to  the  emphasis 
appearing  at  the  end  of  given  essays.  For  this  reader  the 
quality  of  content  seemed  especially  important;  she  noted 
problems  in  logic  and  commented  favorably  when  "a  reasoned 
argument"  occurred,  when  "good  information"  was  presented, 
or  when  a  paper  reflected  "sophistication  of  thought."  Her 
references  to  errors  were  limited  to  such  occasional 
notations  as  "a  few  errors,"  "problems  with  expression," 
"not  literate  enough"  or  "ungrammatical  phrasing."  Just  as 
she  was  apt  to  note  approvingly  if  sentences  were  balanced 
and  sophisticated,  she  also — more  than  any  other  reader — 
disapproved  of  the  use  of  passive  voice. 

For  Reader  IB,  organization,  together  with  supporting 
details,  was  critical.  Such  phrases  as  "organization 
acceptable, "  "organization  okay, "  or  "organization  needs 
improvement "  dominated  his  logs .   Not  only  did  he  respond 


116 


to  the  quality  of  introductions,  but  he  also  took  note  of 
the  quality  of  conclusions  with  such  remarks  as  "weak 
conclusion"  or  "conclusion  trails  off."  Occasional 
references  to  the  thesis  or  to  "cliched  ideas"  also 
appeared,  as  did  comments  on  paragraph  unity  or  coherence. 
He  remarked  on  punctuation  errors  but  otherwise  tended  to 
classify  problems  simply  as  "fundamental  errors"  or 
"careless  errors." 

The  logs  of  Reader  1C,  in  contrast,  rarely  contained 
any  references  to  organization.  Rather,  with  such  phrases 
as  "simplistic  thought,"  she  commented  often  on  content  and 
referred  frequently  to  the  need  for  connectives  between 
sentences  and  paragraphs .  Her  responses  were  tailored  to 
the  particular  texts  in  that  she  cited  specific  errors, 
such  as  "past  tense  of  verbs,"  "no  articles,"  or  "agreement 
errors,"  and  she  virtually  never  grouped  errors  overall. 
This  reader  was  especially  concerned  with  tone,  as  she 
commented  several  times  on  "lapses  [of]  informality"  or, 
using  an  expression  unique  to  her  logs,  referred  to  essays 
that  needed  to  have  a  more  "scholarly  tone." 

Like  the  other  readers,  Reader  ID  responded  often  to 
development,  syntax,  and  structure.  Although  he  identified 
specific  errors  occasionally,  he  primarily  referred  to  them 
simply  as  "mechanical  problems."  He  made  several  observa- 
tions about  organization  but,  unlike  Reader  IB,  rarely 
referred  to  introductions  or  to  conclusions;  similarly,  he 


117 


seldom  mentioned  the  thesis  of  a  paper.  However,  content 
was  important  to  him,  as  he  wrote  "extremely  superficial 
content"  several  times  and  noted  "macro-level  logic 
problems"  in  a  few  entries.  His  logs  were  especially 
distinctive  in  his  strong  responses  to  diction  and  style. 
Such  comments  as  "poor  diction,"  "fairly  strong  diction," 
or  "sophisticated  diction  and  construction"  were  sprinkled 
throughout  his  accounts.  Similarly,  he  wrote  of  "smooth- 
flowing  style,"  "breezy,  creative  style,"  and  "engaging"  or 
"plodding"  styles.  Of  all  the  scorers  at  Table  1,  Reader 
ID  was  the  only  reader  to  respond  frequently  to  the 
elements  of  diction  and  style. 

At  Table  2,  however,  style  was  also  significant  to 
Reader  2A.  Its  importance  was  revealed  in  such  frequent 
comments  as  "style  not  distinguished,"  "perfunctory  style," 
"awkward  style  hurts,"  or,  conversely,  in  a  lengthy 
approbation,  "The  vivid  style  with  concrete  images  provides 
a  good  portrait."  References  to  diction,  coherence,  and 
paragraph  structure  appeared  in  her  logs  as  well;  however, 
the  frequency  with  which  she  wrote  of  content  underscored 
its  particular  effect  on  her  responses.  Repeatedly,  Reader 
2A  commented  approvingly,  "Content  a  plus"  or  "very  fine 
content,"  or  she  wrote  negatively,  "Content  is  pedestrian" 
or  "Content  is  not  always  coherent."  Clearly  for  Reader 
2A,  as  for  Readers  1A,  1C,  and  ID,  the  quality  of  ideas  was 
integral  to  the  evaluation.  In  these  readers'  concern  with 


118 


content,  they  resembled  the  scorers  of  studies  by 
Diederich  et  al.  (1961),  Freedman  (1979),  and  Breland  and 
Jones  (1984). 

For  Reader  2B,  content  seemed  far  less  significant. 
Although  his  responses  covered  a  wide  range  of  categories, 
his  logs  reflected  special  concern  with  diction,  organi- 
zation, and  focus.  Such  phrases  as  "good  word  choice," 
"illogical  word  choice,"  or  "high-level  vocabulary"  were 
scattered  throughout  his  commentaries,  as  were  his  remarks, 
"fair  organization,"  "poor  organization,"  "Conclusion 
introduces  new  information,"  or  "no  organization."  His 
logs  from  the  unmonitored  condition  reflected  a  particular 
awareness  of  focus,  as  several  times  he  wrote,  "Thesis  not 
tied  to  topic,"  "loses  sight  of  topic,"  or  "second  page 
unrelated  to  thesis";  to  a  lesser  extent,  similar 
responses,  such  as  "no  thesis,  rambles"  appeared  in  his 
monitored  logs . 

The  logs  of  Reader  2C  also  reflected  a  concern  for 
focus,  as  she  commented  approvingly  on  those  writers  who 
focused  their  topics  tightly,  rather  than  writing  in  more 
general  terms.  For  this  reader  content  was  important,  too, 
in  that  such  comments  as  "not  really  new  ideas,"  "question- 
able logic,"  or  "good  ideas"  dotted  her  records.  She,  like 
Readers  ID  and  2B,  also  took  note  of  word  choice,  and 
frequent  remarks  such  as  "misuses  words,"  "abstract 
language,"   "good  image,"  or  simply  the  word  "things"   in 


119 


quotes  appeared  in  her  logs.  For  this  reader  organization 
was  rarely  an  issue.  However,  like  Reader  1C,  she  labeled 
specific  errors,  citing  explicit  occurrences  of  fragments, 
homonyms ,  and  verb  endings . 

Reader  2D  clustered  his  references  to  errors  under  the 
umbrella  term  of  "language  skills"  although  he  often 
discussed  the  "clumsy  style,"  the  "unremarkable  style,"  or 
the  "crisp  writing"  that  characterized  some  essays.  He 
frequently  referred  to  the  content  of  essays,  as  well  as 
to  their  introductions  and  conclusions.  Such  phrases  as 
"content  weak,  illogical,"  "content  mediocre,"  or 
"excellent,  content,  interesting"  permeated  his  records. 
Equally  pervasive  were  his  references  to  "competent," 
"tedious"  or  "superficial"  introductions  and  to  "pedes- 
trian," "repetitious,"  or  good  conclusions  that  even 
"expand[ed]  the  subject." 

As  can  be  seen,  distinctive  patterns  appeared  in 
the  logs  of  all  readers .  The  patterns  crossed  boundaries 
of  gender  and  of  instructional  level,  revealing  the 
individuality  of  each  scorer's  response.  To  be  certain, 
the  specific  nature  of  these  comments  should  not  be 
overemphasized,  for  as  Figures  B-2  and  B-3  in  Appendix  B 
indicate,  a  core  of  common  responses  underlies  their 
evaluations.  Nevertheless,  these  portraits  suggest  that 
each  reader  also  brought  an  individual  perspective  or  lens 
through  which  to  view  the  student  essays . 


120 


The  individuality  of  each  reader's  response  to 
the  essays  was  revealed  in  the  taped  protocols  of  the 
special  scorers  as  well.  Like  Reader  2D  and  several  other 
regular  readers,  Reader  3B  commented  on  the  strengths  and 
weaknesses  both  of  introductions  and  of  conclusions. 
Although  she  found  the  presence  of  titles  "irrelevant,"  she 
repeatedly  approved  of  good  thesis  statements  and  objected 
to  thesis  statements  that  were  too  vague  or  that  obscured 
what  the  writer  was  talking  about.  Like  Reader  1C,  she 
remarked  on  the  need  for  transitions  in  several  papers  and 
approved  the  use  of  anecdotes  as  a  means  to  support  some 
assertions.  Echoing  Reader  2C's  dislike  of  the  word 
"things,"  as  well  as  other  vague  language,  Reader  3B  called 
attention  to  the  effective  imagery  in  one  student's  use  of 
the  phrase  "domino  effect  of  awareness."  Her  comments  on 
handwriting  dealt  only  with  her  occasional  struggle  to 
decipher  certain  words;  the  struggle  seemed  part  of  her 
effort  to  make  meaning  of  the  essays,  as  revealed  in  her 
references  to  "faulty  logic,"  in  her  interpretations  of 
what  a  given  writer  must  have  meant,  or  in  the  answers  she 
gave  to  her  own  questions,  "What  helps?  Technology,  I 
suppose. " 

Like  Reader  3B,  Reader  3C  was  also  concerned  with 
introductions  and  conclusions.  She  commented  frequently  on 
the  need  for  transitions  and  noted  when  the  vocabulary  was 
good  or  when  a  better  word  was  needed.   Her  tapes,  in 


121 


particular,  reflected  the  attempt  of  a  reader  to  follow 
various  writers '  explanations  and  to  understand  what  was 
being  said.  For  example,  referring  to  one  paper  on 
oceanography,  she  expressed  trouble  with  the  idea  of 
cameras  and  bells  studying  at  ocean  depths;  later,  she 
observed,  "I'm  not  following  the  logic  in  that  paragraph — 
maybe  that's  the  problem — it  really  doesn't  seem  to  say  as 
much  as  at  first  I  thought."  Her  references  to  logic  were 
frequent,  as  were  her  indirect  allusions  to  focus:  On  one 
occasion,  she  tried  to  determine  why  a  paper  was  "rough 
reading"  and  concluded  that  her  problems  with  it  arose 
because  it  failed  to  adhere  to  its  initial  generalization. 
In  her  attempts  to  make  meaning  of  the  essays,  Reader  3C, 
like  Reader  3B,  resembled  the  placement  scorers  cited  in 
the  study  of  Barritt  et  al.  (1986). 

For  Reader  3A  dramatic  scenes  were  preferable  to  more 
general,  abstract  introductions;  similarly,  he  noted  that 
he  liked  to  "look  at  the  drama — how  people  accomplish  drama 
in  conclusions."  Admitting  that  length  and  handwriting 
were  factors  in  his  responses,  he  stressed  the  importance 
of  style.  He  expressed  irritation  over  the  he/she 
indecision  about  gender,  calling  it  "ungraceful,"  and  he 
commented  often  on  the  "sophistication  of  diction,"  noting 
once  with  irritation,  "an  ill  phrase,  a  vile  phrase." 

Like  Readers  3B  and  3C,  he  remarked  on  the  presence  or 
lack  of  logic  in  several  essays:   "This  person  is  just  not 


122 


doing  very  clear  thinking.  I'm  beginning  to  think  it's 
myself,  but  I  think  it's  the  student."  Such  phrases  as 
"really  silly  thinking"  or  "simplistic  logic"  were 
sprinkled  throughout  his  tapes.  At  the  same  time,  he 
explained  that  he  did  not  "struggle  with  the  progression  of 
[a]  person's  logic"  as  did  so  many  other  readers;  rather, 
he  wanted  to  derive  coherence  from  the  overall  flow  of  an 
essay. 

Reader  3A  spoke  of  rewarding  or  penalizing  essays  for 
certain  qualities  and  with  some  essays  wondered  aloud  what 
score  the  second  reader  would  be  likely  to  give.  The  most 
distinctive  trait  of  Reader  3A,  however,  was  his  tendency 
to  respond  to  whatever  students  discussed  in  essays  in 
terms  of  his  own  personal  experiences--whether  it  be 
grocery  stores,  the  dentist's  office,  or  an  old  Woody  Allen 
serial. 

Reader  3D  shared  a  number  of  response  traits  with 
Reader  3A  even  though  both  taught  at  different  types  of 
institutions  in  different  towns.  Like  Reader  3A,  Reader  3D 
talked  about  rewarding  or  penalizing  essays,  a  concept 
seldom  mentioned  by  the  other  special  readers.  Reader  3D, 
also,  like  Reader  3A,  occasionally  speculated  as  to  what  a 
second  reader's  score  might  be,  and  he,  too,  responded 
often  to  diction.  His  tapes  were  dominated  by  such 
comments  as  "I  like  the  diction  level  here.  I  like  the 
trouble  this  kid  went  through.   I  really  appreciate  it," 


123 


or,  conversely,  "Some  of  the  vocabulary  is  a  little 
awkward,  a  little  strange  at  times.  .  .I'm  waiting  for  a 
little  more  careful  use  of  the  language  here."  Like 
several  of  the  other  readers,  Reader  3D  was  concerned  with 
logic  and  meaning;  occasionally,  in  referring  to  jumbled 
ideas,  he  used  the  term  semantic  abbreviation,  which  he 
attributed  to  Collins  and  Williamson's  work  (1984),  to 
indicate  that  the  student  had  not  said  enough. 

Unique  to  Reader  3D  were  his  concerns  both  with 
text  and  with  revision.  Openly  acknowledging  that  he  was 
affected  by  handwriting,  he  observed  that  handwriting 
contributed  to  the  visual  impression  of  text  that  each 
essay  made;  he  was  similarly  impressed  by  titles,  and 
he  took  note  of  indentations  of  paragraphs  and  the 
straightness  of  margins.  Saying  that  "part  of  the  sense  of 
text  in  writing  must  be  visual,"  Reader  3D  explained  that 
he  wanted  to  give  each  student  writer  "the  opportunity  to 
demonstrate  that  he  has  some  sense  of  vision — some  sense  of 
the  visual  or  visible  quality  of  what  a  piece  of  writing 
needs  to  be."  Another  part  of  that  visual  impression 
entailed  signs  of  revision;  for  Reader  3D,  erasures  and 
cross-outs  signaled  that  some  thinking  was  going  on, 
along  with  the  writing.  So  important  was  revision  to  this 
reader  that  he  used  the  terms  "bleeder"  and  "barfer"  to 
distinguish  between  those  writers  who  agonized  over  each 
word  of  each  sentence  and  those  who  wrote  first  and  then 
examined  their  work  afterward. 


124 


Therefore,  as  can  be  seen,  the  tapes  confirmed  the 
individuality  of  perspective  which  each  reader  brought  to 
the  scoring.  Although  all  the  readers  shared  some  concerns 
in  common,  they  each  had  individual  patterns  of  response  to 
the  essays.  Such  individuality  is  not  new,  for  a  number  of 
studies  in  the  literature  review,  including  the  recent 
works  by  Martin  (1987)  and  by  Huot  (1988),  have  underscored 
the  individuality  of  each  reader's  response. 

The  Writer  Behind  the  Paper 

The  individuality  extended  to  several  readers' 
envisioning  of  or  interaction  with  the  writer  behind  the 
paper.  The  effect  of  readers'  perceptions  of  the  writer 
behind  the  essay  has  been  the  subject  of  recent  concern  for 
the  researchers  Barritt,  Stock,  and  Clark  (1986)  and 
Sullivan  (1986);  as  noted  in  the  previous  discussion  of 
Question  4  and  as  shown  in  the  logs,  several  readers  in 
this  study  felt  that  their  perceptions  of  the  writer  behind 
the  paper  affected  their  responses.  For  example,  Reader 
2A's  responses  often  showed  sympathy  for  students'  "rough 
draft"  performance,  and  Reader  2D  wrote  approvingly  of 
several  students '  ability  to  write  knowledgeably  about 
their  subjects.  Most  of  the  readers'  responses  to  writers 
were  indirectly  expressed.  Only  in  the  logs  of  Readers  1A 
and  2C  did  the  readers  respond  directly  to  the  writers 
behind  the  essays.  Reader  1A  occasionally  reacted  to  the 
writer's  ideas,  with  such  comments  as  "What  about  those  who 


125 


are  not  of  a  Judaeo-Christian  background?"  or  "Help,  the 
specialists  are  winning!  So  much  for  the  generalists  and 
the  well-rounded  individual  with  the  curiosity  to  explore 
further  than  our  own  backyard."  Reader  2C  also  reacted 
directly  to  some  content  and  to  the  writer  of  those  ideas. 
She  wrote  angrily,  "Where  does  he  get  ideas  like  teachers 
are  just  anyone  off  the  street?!"  With  an  apology  for 
commenting  personally  about  one  student's  argument,  she 
wrote  about  another  paper,  "[The  writer  is]  a  little 
uncompassionate  to  students  who  need  to  work  and  whose 
parents  are  already  being  responsible.  May  also  need  to 
consider  if  more  will  really  benefit  students  who  already 
want  less  1 " 

Thus,  the  logs  reflected  to  a  limited  extent  several 
readers'  awareness  of  the  writers  behind  the  papers  and  an 
attempt  to  react  to  these  students '  ideas . 

Such  interaction  or  engagement  with  the  writer  is 
more  readily  apparent  in  the  taped  protocols  of  the  special 
readers.  The  accounts  of  Reader  3D,  in  particular, 
reflected  a  running  dialogue  with  the  writers .  Not  only 
did  he  frequently  interject  such  comments  as  "true, " 
"interesting,"  or  "okay,"  but  he  also  responded  directly  to 
the  student:  "Well,  show  me  what  you  mean  by  better,  kid" 
or  "I'm  waiting  for  something  more  definite,  young  lady, 
young  man. "  Similarly,  as  the  three  quotations  below 
indicate,  Reader  3D  often  stepped  back  from  the  papers  to 
talk  about  the  writers : 


126 


1.  Nice  basic  piece  here.  Almost  reads  like  a 
summary  of  something  the  kid  studied  in  some 
detail  and  tends  to  care  about,  I  think. 

2 .  I'm  not  sure  where  this  is  headed  at  this 
point.  This  kid  is  trying  hard,  though.  It's 
obvious  he  is  trying  hard. 

3.  My  impression  at  this  point  is  that  this  is  a 
kid  who  is  struggling  to  write — to  take  the 
inner  speech  that  is  working  up  inside  the  head 
and  trying  to  put  it  down  in  a  manner  that  is 
acceptable. 

Throughout  his  tapes  Reader  3D  engaged  in  a  dialogue 
with,  as  well  as  a  commentary  about,  the  writers  them- 
selves. He  later  observed  that  the  process  of  verbalizing 
his  comments  on  tape  had  made  him  more  humane  in  his 
responses . 

Like  Reader  3D,  Reader  3A  also  talked  constantly  to  and 
about  the  writers .  He  argued  with  some  of  the  writers ' 
notions,  declaring,  "That's  not  true.  I  don't  believe  it," 
or  questioning  the  source  of  another  writer's  statistics. 
He,  too,  stepped  back  on  occasion  to  make  observations 
about  some  writers:  "Interesting  small  subject,  but 
apparently  not  a  good  choice  for  this  writer  'cause  here 
she  doesn't  have  anything  to  say  about  it." 

Thus,  in  the  taped  protocols  of  both  men  there  is  an 
ongoing  display  of  interaction  with  the  writer.  A  similar 
awareness,  albeit  to  a  lesser  extent,  appeared  in  the 
protocols  of  the  two  women  readers.  Occasionally,  Reader 
3B  responded  directly  to  the  writer's  statements  with  a 
question  or  a  simple  phrase,  such  as  "Oh,  joy"  to  the  idea 


127 


of  a  television  in  a  dentist's  office.  More  often,  how- 
ever, her  comments  were  about  the  writer,  as  revealed  in 
her  remarks,  "The  person  is  obviously  talking  about 
something  she  knows  about,  very  interesting"  and  "He  is 
prejudiced  clearly,  and  makes  assertions  that  he  does  not 
adequately  support,"  or  in  her  observation,  "This  person 
has  obviously  never  tried  to  teach  a  class  outside  on  a 
pretty  spring  day. " 

The  formality  conveyed  by  Reader  3B's  use  of  the 
term  "person"  rather  than  3D's  "kid"  appeared  in  Reader 
3C's  tapes  as  well.  Only  rarely  did  Reader  3C  respond 
directly  to  the  content  with  such  comments  as  "I  didn't 
know  that"  or  "Okay,  if  that's  indeed  a  big  benefit."  Her 
remarks  about  the  writers  were  infrequent,  too,  limited  to 
such  phrases  as  "I  liked  some  of  the  reasoning  this  person 
gave .  .  .  . " 

Just  as  the  logs  revealed  among  the  eight  regular 
readers  varying  degrees  of  awareness  of  the  writers  behind 
the  papers,  so,  too,  did  the  taped  protocols  indicate  a 
spectrum  of  response  among  the  four  special  readers. 
Whereas  the  two  men  seemed  directly  and  frequently  involved 
with  the  writers,  the  women's  interactions  were  less 
frequent  and  more  formal . 

The  Causes  of  Errors  and  Their  Remedies 

Like  some  of  the  regular  readers,  the  special  readers 
noted  specific  errors  as  they  talked  their  way  through 


128 


the  papers.  In  addition,  they  attempted  to  explore  the 
reason  for  the  difficulties.  The  logs  did  not  fully 
address  this  concern,  as  the  eight  regular  readers  had 
neither  the  time  nor  the  space  to  explore  the  causes  of 
errors;  only  some  tangential  references  to  probable  causes 
occurred  with  Readers  IB  and  lC's  remarks  about  careless 
errors  and  with  Reader  2A's  frequent  emphasis  on  the  papers 
as  rough  drafts.  The  special  readers,  however,  often 
remarked  on  why  a  particular  error  might  have  occurred, 
frequently  attributing  such  errors  to  haste.  Reader  3C's 
comment  was  typical:  "I  think  that  this  person  just  writes 
quickly  and  makes  careless  errors.  At  least,  that's  what 
I  chalked  it  up  to  right  now,  because  the  content  seems 
to  be  good,  and  the  sentences  seem  to  be  well  done." 
Likewise,  Reader  3D  speculated  that  a  student  probably 
thought  "chances  instead  of  changes.  The  writer  didn't 
even  realize  he  or  she  was  doing  that  in  the  haste  of  the 
situation. " 

The  special  readers  considered  other  possibilities 
as  well.  In  one  instance  Reader  3A  attributed  the 
lack  of  n  on  the  article  an  to  a  revision  in  which  the 
writer  failed  to  make  all  the  necessary  changes .  Still 
other  sources  of  error  cited  were  the  writer's  lack  of 
sensitivity  to  audience  and  the  interplay  of  written  and 
spoken  language,  as  shown  when  Reader  3D  observed,  "Seems 
to  me  that  this  is  the  person's  first  draft  relying  heavily 


129 


on  speech . "  Often  the  readers  were  sympathetic  to  the 
students .  In  referring  to  one  paper  with  an  audience 
problem,  Reader  3D  stated,  "Makes  me  feel  bad  because  if 
this  were  intended  for  an  audience  of  draftsmen,  I'm  sure 
they'd  all  be  sitting  and  nodding  right  now  and  saying  what 
a  wonderful  piece  this  is.  But  unfortunately,  that  wasn't 
his  audience.  His  audience  is  an  English  teacher  doing  a 
test."  This  reader  noted  that  he  often  put  himself  in  the 
place  of  the  students,  remembering  his  own  written  and  oral 
examinations . 

But  with  other  errors,  the  readers  expressed 
frustration  at  students'  lack  of  knowledge.  Referring 
to  an  example  of  poor  diction,  Reader  3B  commented,  "An 
upper-half  writer  shouldn't  be  making  errors  like  these — 
like  installate. "  Similarly,  Reader  3A,  in  expressing 
irritation  over  a  lack  of  word  endings,  argued,  "[People] 
should  be  required  or  taught  not  to  abandon  the  word  before 
they  move  on  to  the  next  one  physically  with  their  hands 
and  put  their  minds  on  the  next  word. " 

The  readers '  remarks  on  the  causes  of  errors  carried 
over  to  a  concern  for  probable  remedies .  Reader  3B  noted 
that  a  brief  comma  review  would  help  one  student,  and 
Reader  3D  speculated  that  a  10-minute  lesson  on  apostrophes 
would  benefit  another  writer.  Reader  3A  suggested  that  one 
especially  weak  writer  desperately  needed  a  one-on-one 
conference.   His  frustration  at  being  unable  to  help  such 


130 


a  writer  was  clearly  evident  in  his  probing  question  of  how 
much  good  he  could  actually  do:  "How  do  you  generalize  the 
name  of  that  error?  Then,  how  do  you  get  the  person  to 
understand  and  not  do  it  again?  You  can't  .  .  .  language 
skills  sometimes  seem  to  be  so  difficult  to  pinpoint." 

For  Reader  3A,  as  well  as  for  the  other  readers  to  a 
lesser  extent,  the  taped  evaluations  of  student  essays 
served  as  a  point  of  departure  for  general  speculations 
about  writing.  For  example,  Reader  3B  discussed  how  one 
paper  embodied  "the  very  definition  of  good  writing"; 
similarly,  Reader  3D  commented  that  for  him,  writing 
was  truly  effective  when  it  "filled  [his]  questions 
and  enabled  [him]  to  read  like  a  reader  and  not  like  a 
teacher."  Both  Readers  3D  and  3B  commented  on  the 
inappropriate  familiarity  with  the  reader  that  the  use 
of  "you"  reflected  and  on  its  frequent  occurrence  as  a 
characteristic  of  less  sophisticated  writers.  Reader  3A 
labeled  as  another  trait  of  unsophisticated  writers  an 
inability  to  deal  with  ambivalence  or  ambiguity.  Stressing 
the  need  for  using  a  "domino  effect"  whenever  editing  so 
that  one  change  generates  the  subsequent  changes  also 
needed,  Reader  3A  observed  about  one  paper:  "The  style 
is,  on  the  one  hand,  sophisticated  but,  on  the  other, 
incorrectly  done.  I  guess  all  writing  has  things  that  work 
and  things  that  don't  work."  With  frustration  he  reflected 
on  the  difficulty  of  teaching  writing,  especially  in  view 


131 


of  its  close  link  to  thinking:  "How  would  you  begin  to 
work  with  this  person's  mind?"  For  the  special  readers 
doing  the  tapes,  then,  the  evaluation  of  individual  essays 
led  to  more  general  speculations  about  writing. 

Scoring  Approaches  and  Preferences 

Readers  were  also  asked  to  speculate  about  their 
own  approaches  to  scoring  papers  holistically.  One  open- 
ended  question  on  the  questionnaire  (see  Part  V  of  the 
questionnaire  in  Appendix  A)  asked  readers  to  describe 
their  processes  in  holistically  scoring  papers;  these 
processes  again  underscored  the  readers'  individuality. 
For  example,  the  tapes  showed  Reader  3A  often  announcing 
an  immediate  score  based  on  what  he  described  in  his 
questionnaire  as  handwriting,  the  language  of  a  sentence, 
the  length,  and  the  presence  or  absence  of  detail.  Then  he 
continued  reading  to  determine  whether  the  initial  judgment 
would  hold.  Reader  1A,  like  several  others,  tended  to  read 
the  first  paragraph  carefully  and  make  a  tentative  judgment 
as  to  upper-half,  lower-half.  After  reading  the  remainder 
of  the  paper,  she  would  adjust  the  score  accordingly. 
Reader  3C  indicated  that  she  first  decided  to  which  half 
the  paper  belonged,  mentally  assigned  a  score  of  4.  or  2, 
and  scored  down  from  there.  Reader  ID  wrote  that  he  rarely 
changed  his  mind  on  the  upper/lower  half  distinction  after 
the  first  paragraph  and  often  knew  the  final  score  by  the 


132 


third  paragraph.  Reader  3B,  in  contrast,  noted  that 
sometimes  her  score  changed  two  or  three  times  in  the 
course  of  reading  a  paper.  Reader  2C  responded  that  she 
"read  the  whole  paper,  and  some  papers  just  seem  to  be  a 
particular  score."  Still  other  readers  used  a  combination 
of  methods.  As  Reader  2A  explained,  some  papers  were  easy 
to  score  whereas  others  required  a  closer  scrutiny  or, 
occasionally,  even  rereading.  Thus,  the  methods  the 
various  readers  used  underscored  their  different  response 
patterns . 

Still  other  explanations  for  readers'  differing 
responses  to  the  essays  may  lie  in  the  additional  answers 
they  gave  to  some  sections  of  the  questionnaire.  Three 
parts  are  especially  applicable:  (a)  the  readers'  open- 
ended   descriptions   of   their  own   scoring   tendencies, 

(b)  their  ratings  of  their  biases  and  preferences,  and 

(c)  their  ratings  of  criteria  the  readers  considered 
important  in  judging  timed  writings. 

Most  of  the  readers  described  themselves  as  fair  in 
their  scoring;  the  majority  also  noted  that  they  were 
strict,  with  several  tending  toward  the  low.  Four  readers 
acknowledged  that  they  were  generous  or  charitable  with 
better  papers.  Reader  3D,  while  describing  his  scoring 
tendencies  as  fair,  noted  that  at  times  his  scoring  was 
affected  by  "the  view  of  the  student  through  his/her 
prose."    Indeed,  as  has  been  discussed,  the  protocols 


133 


confirmed  this  reader's  awareness  of  the  writer  behind 
the  essay. 

The  readers  frankly  acknowledged  their  biases  or 
preferences  by  placing  a  plus  (  +  )  or  a  minus  (-)  beside 
those  items  on  the  questionnaire  that  triggered  either  a 
strong  positive  or  a  strong  negative  response  in  their 
reading.  They  did  not  mark  items  toward  which  they  were 
neutral;  they  could  add  items  not  included. 

As  Table  6  indicates,  two-thirds  of  the  12  readers 
agreed  that  they  reacted  negatively  to  misinformation  in 
papers,  to  shallow  essays,  to  hard-to-read  handwriting, 
and  to  extremely  short  papers .  Eight  or  nine  readers 
also  agreed  that  they  responded  positively  to  creative 
papers,  to  humor,  and  to  a  delightful  writer  behind  the 
essay.  But  the  individuality  of  the  readers  appeared  in 
the  mixed  response  that  most  other  categories  generated: 
For  example,  whereas  two  readers  acknowledged  responding 
negatively  to  rhetorical  devices  and  positively  to  first- 
person  narratives ,  one  reader  reacted  to  each  of  these 
categories  in  the  reverse.  Other  readers  were,  presumably, 
neutral  about  those  areas .  Whereas  three  readers  liked 
technical/scientific  papers,  two  did  not;  though  six 
readers  responded  negatively  to  religious  papers,  one 
liked  such  essays.  Even  the  category  of  "disagreeable 
writer"  generated  a  mixed  response,  as  two  readers  noted 
that  they  reacted  positively  to  the  sign  of  any  writer 


134 


TABLE  6 
Results  of  Questionnaire  on  Biases  and  Preferences 


Questionnaire 
(Part  2,  Item  25) 


Number  of 

Positive 

Responses 


Number  of 
Negative 
Responses 


Political  papers 

Social  issues 

Religious  papers 

First-person  narratives 

Technical  papers 

Literary  allusions 

Creative  papers 

Misinformation 

Humor 

Severe  misspellings 

Shallow  papers 

Rhetorical  devices 

Disagreeable  writer 

Illegible  handwriting 

Extremely  short  papers 

Weak  conclusion/introduction 

Inductive  papers 

Sentimental  papers 

Delightful  writer 

Slang 


Notes;  The  17  participants  checked  with  pluses  or  minuses 
only  those  elements  which  triggered  a  strong 
personal  reaction,  either  negative  or  positive. 

Other  write-in  categories  included  wit,  irony, 
percept iveness,  attacks  on  the  test,  sarcasm,  and 
unexamined  values . 


135 


behind  an  essay.  The  taped  protocols  revealed  other  biases 
on  the  part  of  special  readers — from  one  reader's  dislike 
of  the  phrase  "a  lot  of"  or  the  use  of  the  pronoun  "one"  to 
another  reader's  dislike  of  jargon  and  a  third  reader's 
dislike  of  a  formal  tone. 

To  adjust  for  their  own  biases — of  which  readers  were 
obviously  aware — the  readers  identified  several  strategies 
they  used  most  frequently.  Two-thirds  indicated  that  they 
often  slowed  down  when  they  encountered  a  paper  that 
triggered  a  strong  personal  reaction,  and  one-third  said 
they  reread  the  paper.  Four  readers  noted  that  they 
consulted  "often"  with  the  table  leaders,  four  only 
"sometimes,"  and  the  remaining  four  "seldom"  or  "never." 
The  readers  were  equally  divided  between  "sometimes"  and 
"seldom"  in  their  tendency  to  reexamine  the  rangef inders . 
One  reader  wrote  that  he  put  the  papers  at  the  end  of  the 
stack  to  return  to  for  reexamination  after  a  break. 

Readers  were  also  asked  to  rate  the  writing  criteria 
they  considered  important  in  evaluating  timed  essays;  they 
checked  which  of  24  areas  dealing  with  rhetoric,  style, 
grammar,  and  mechanics  they  believed  to  be  "very  important" 
(4  points),  "important"  (3  points),  "somewhat  important" 
(2  points),  or  "not  very  important"  (1  point). 

In  this  study,  unlike  Breland  and  Jones's  (1984), 
correlations  were  not  obtained  between  the  readers ' 
ratings  of  criteria  important  to  their  evaluations  and 


136 


their  actual  responses  as  reflected  through  logs  and  tapes. 
Nevertheless,  Figures  6  and  7  illustrate  the  variations  in 
importance  that  individual  readers  placed  on  the  different 
writing  criteria. 

That  creativity  occurs  as  the  least  important  feature 
in  these  ratings  is  somewhat  surprising.  However,  it  must 
be  remembered  that  the  criteria  which  readers  were  asked  to 
rate  were  the  features  they  valued  most  in  assessing  timed 
writings;  morover,  the  ratings  were  relative,  ranging  from 
"very  important"  to  "not  very  important"  with  absolutes, 
such  as  "not  at  all  important,"  excluded.  In  this  context 
of  timed  assessment  and  relative  comparisons,  the  placement 
of  creativity  at  the  bottom  of  the  scale  (see  Figures  6 
and  7)  seems  less  disturbing.  For  example,  even  the  lowest 
score  on  the  scale  (a  total  of  25)  represents  an  average 
ranking  of  2  points  for  each  of  the  12  readers,  a  ranking 
which  signifies  "somewhat  important."  Their  responses  may 
thus  indicate  that  creativity  is  simply  not  essential  in 
timed  assessments;  that  is,  while  students  can  be — and,  in 
fact,  typically  are — rewarded  for  timed  essays  that  are 
creative,  students  whose  essays  are  otherwise  strong  and 
solid  are  not  penalized  for  their  lack  of  creativity  in  an 
assessment  situation.  That  the  criterion  of  the  writer's 
commitment  to  the  topic  ranked  similarly  low  was  perhaps 
due  to  readers'  recognition  that  assessment  topics, 
assigned  as  they  often  are,  will  not  engage  the  students 


137 


31 

^•m«j'"*ro(S'JlMM>dlHMfO!NHN(YiHrH,sloj(Yinrn 

•>* 

cn 
u 

•n\ 

CD 

T3 

cO 

CD 

u 

Jl 

^^oo^^(N(NrO'a,CNr«orororO(Nrnrornrof,ocNicNrNiro 

"n| 

rH 

(0 
T3 

X|| 

^^CNr^rocNCN'^'TrrOiHrooooocNirorocNrNjcNiHiHrHro 

"° 

•H 

> 

■H 

c 

■H 

■o| 

o)mi,^,n(N(Nron(yim'*ro(yirni'ro(vir>jf,imrocN^' 

*o 

X! 
(0 

-H 

ni 

N| 

^MnTMMm(NnN(N(ylCMMOJ>J,m(Nf»imrOMrom 

•=r 

^1 

l"IM<3,^lM(N(,l<*fOrO(,1d(*l!vl"*^"*iy)(*l(NrorOf|1 

u 

CD 
4J 
■H 

V4 

mi 

"Sl| 

^ro^ro(^CNr\icvirocNcN(^M^foro(N<Yiro'd,<NCNJCNro 

u 
Cn 

c 

■H 
4-1 
■H 
M 

<C| 

^o^^^^^^o^^cN"^,f,^CN<v^^O(v1^o^l'•^*^,^,^1,f,^f,r>^^^^J, 

■^ 

Ol| 

s 

o 

4J 

Ql 

»f  f  <*  <f  n  ro  M    1    ^trororororooom^rroromcNrocNro 

^ 

HI 

Ti 

CD 

c 

Cn 
•H 

cn 

t^i'^'j^rfnnroromtNd^nn'^^MrnntNcsmi1 

cn 
CO 

rH 

f^i'j,cNf,r)Tj,roniHroncN'*'*rocNnoorocN    1    ro  co  cn  ro 

cn 

4-1 

c 

0 

04 

< 

'*'Jn'J'<d"nivi(N<iMtN^f»j,^',*'*roro(r)rnr\)CMiNn 

ro                                                                             cn 

4-> 

4-> 

14-1 
0 

<D                                                                  c  cn 

c 

(0 

>i 

fd                                                  -P           0  (U 

M 

cd 

4-> 

H                                                                     U                1     O 

a) 

4-1 

rH 

CO 

<D   3             c  C        cn 

^ 

u 

0 

Cn                                    U                   <-\    U               ^   Q)         U 

+J 

rd 

0 

Cu 

§ 

c                               CD                 >,+>            pi-ucno 

cn 

X 

a> 

a  b 

3 

CO 

•H                                            4->                        4->   CO                         C    M    W 

CD 

a> 

BS 

g 

•  rH 

H                             -h                c/i                  -  <d  O  m  cn 

•H 

+J 

•H 

H         4J                             >H                               •GcnUDMHV-l 

u 

c  u-t 

4-> 

ox;                  s               (u+jo-pmo 

0 

0 

o 

4-> 

cn 

, 

S-i         Cn                                                  UC-HCG'dW+JV4                c 

CT 

u 

cn 

(0 

C£> 

4->         3                         4-1                    CCD4->OCDCDOM               0 

CD 

+J 

0 

CD 

c      o      cc      o              (ucou-hehoiiiw          -h 

4-> 

iw 

c 

S  v-4 

CD 

0        ,C  4->   0   0                               4->        -H  4->    C7>  tn  tr>r~l               C4-) 

CCJ 

0 

a) 

CO 

u 

U        EhC-H-hC+J               >iCWQCJH)C(l)(l)J         0    cd 

u 

e 

0) 

II 

II 

3 

(D4J+1   0  C              4-iCDO        -H^rdcn-HCfl        -HN 

<B 

c 

o 

Cn 

-rH 

> 

CD        ih   E   (J   O-H   J)              -HW         (UQCuHDQWCT-P-'H 

c 

u 

cn 

•H 

c 

•^ 

rH 

M 

4->             O    ttN     3    01    E                     >             >1-t->                                                     CfOr-l 

•H 

•H 

13 

(0 

CO 

fe 

O 

1 

a 

0) 

-P 

V-l 

cn 

^cnXI^HCOH-H^i        +JCQ)MWT3rOTirdraH4J4-J4-l 

d) 

0 

O  H 

n 

cd 

•• 

CD 

01  3  4->    CD    rO    V-i   O   g-P   <D    (0    CD  -H    3    3  -h  -H  —1  -H  —I  i-H    u-h    Cn 

+J 

En 

•rH 

s 

CD 

CD 

CD 

4-> 

uua>oi+JcE-HC«)3Mt)+Jooooo(i)cac 

■H 

1 

Cn  0 

4J 

a 

4-1 

(0 

1dOH)H)MCOOCO^Hii!U(l)>>>>>Di3U(D 

In 

c 

o 

c 

c 

a 

o 

U 

<CnQDOHUO3EnUti><;S<:<<<<C0CnUJ 

s 

o 

J 

w 

w 

<=c 

53 

138 


Most 
Important 


44*  Development/Adequate  Controlling  Idea 

43  Focus 

41  Avoidance  of  Fragments  and  Run-ons 

40  Depth  of  Thought 

39  Fluent  Sentence  Style/Avoidance  of  Tangled  Sentences/ 
Length 

38  Variety  of  Sentence  Structure 

37  Accurate  Diction 


Least 
Important 


35  ESL  Errors 

34  Tone/Usage  Errors 

33 

32  Dialect  Errors 

31  Mature  Diction/lntroduction/Conclusion 

3  D  Punctuation 


28    Commitment  of  Writer 


Spelling/Capitalization 


25    Creativity 


Note:   Each  reader  could  assign  a  maximum  of  4  points  per  criterion. 
*The  numbers  represent  total  points  assigned  by  12  readers. 

Figure  7.   Readers'  ratings  of  importance  of  criteria  in  timed 
writings . 


139 


in  the  same  way  that  topics  of  choice  for  outside 
assignments  may  do. 

An  informal  survey  of  the  questionnaires  in  comparison 
to  the  logs  or  tapes  confirms  the  validity  of  many  features 
which  readers  said  they  deemed  to  be  important .  Reader  1A, 
for  example,  who  commented  frequently  in  her  logs  on 
the  thesis  and  on  "then/now"  organization,  rated  those 
rhetorical  features  as  very  important  to  her.  Similarly, 
Readers  ID,  2B,  2C,  and  3A,  all  of  whom  had  commented 
repeatedly  on  diction  in  their  responses  to  the  essays, 
admitted  on  the  questionnaire  to  valuing  word  choice 
highly.  Reader  1C  responded  on  the  questionnaire  that 
sentence  variety  was  very  important  to  her,  and  the 
frequency  of  her  comments  in  the  log  corroborated  its 
significance.  In  addition,  Reader  2D  showed  through  both 
his  logs  and  his  questionnaires  the  importance  that  he 
placed  on  content.  Thus,  the  features  which  the  readers 
rated  either  as  "very  important"  or  "important"  on  the 
questionnaire  were  often  the  same  features  to  which  they 
responded  in  the  essays.  In  this  respect,  the  readers  of 
this  study  differed  significantly  from  those  in  Harris' 
study  (1977)  and  in  Stach's  (1987)  in  which  the  actual 
basis  for  writing  judgments  was  far  different  from  what  the 
readers  thought  it  would  be. 

At  the  same  time,  some  discrepancies  in  this  study 
could  be  found.   For  example,  Reader  1C  described  thesis 


140 


and  focus  as  being  very  important  in  her  writing  judgments, 
yet  she  rarely  mentioned  these  elements  in  her  logs. 
Reader  2A,  responding  to  an  optional  item  on  the  question- 
naire, stated  that  topicality  was  crucial  to  her  but 
never  addressed  these  elements  in  her  logs.  Thus,  the 
questionnaires  suggested  that  more  elements  were  important 
to  readers  than  their  responses  to  the  essays  might  have 
suggested;  this  finding  is  not  surprising  in  that  the  logs 
comprised  brief  summaries  and  were  written  in  the  midst  of 
an  actual  scoring  process  when  readers  were  attempting  to 
score  substantial  numbers  of  papers. 

Conversely,  the  logs  revealed  that  some  features, 
believed  by  readers  to  be  only  "somewhat  important"  or  "not 
very  important,"  might  have  more  significance  than  the 
readers  necessarily  realized.  For  example,  Reader  IB  rated 
the  category  Depth  of  Thought  as  only  "somewhat  important" 
to  him.  However,  in  the  unmonitored  scoring  he  commented 
in  several  instances  on  logic,  cliched  ideas,  content,  and 
intellect.  Similarly,  Reader  2B,  who  rated  Introduction, 
Conclusion.  and  Tangled  sentences  as  only  "somewhat 
important,"  responded  to  these  particular  features  in  his 
logs.  Reader  2C,  who  had  frequently  specified  the  nature 
of  misspelling  in  her  logs  by  noting  "homonyms,"  likewise 
rated  spelling  as  only  "somewhat  important"  to  her. 

The  discrepancies  between  what  some  readers  said  was 
important  to  them  and  what  they  showed  as  being  significant 


141 


may  be  due  to  several  factors:  First,  readers  may  have 
been  unaware  of  what  was  truly  involved  in  their  own 
scoring  judgments,  although  the  metacognitive  awareness 
reflected  by  their  other  self-reports  makes  such  an 
explanation  unlikely.  Second,  the  subjectivity  entailed  in 
interpreting  the  degrees  of  importance — especially  a 
category  such  as  "somewhat  important" — may  account  for  the 
occurrence  of  some  responses  in  the  logs.  Finally,  the 
experience  of  recording  comments  in  logs  or  on  tapes  was  a 
new  one  for  virtually  all  the  readers.  Consequently,  in 
the  process  of  an  actual  scoring,  they  may  have  been  unable 
to  note  in  either  written  or  taped  form  all  the  features 
to  which  they  were  responding.  As  Freedman  and  Calfee 
(1983)  note,  articulating  an  evaluative  response  to  a  work 
is  difficult;  hence,  this  latter  explanation  seems  most 
likely. 

Summary  of  Data 

Taken  together  all  the  data — from  the  logs  and  taped 
protocols  to  the  closed  and  open-ended  questions  on  the 
questionnaire — underscore  the  individual  perspective  that 
each  reader  brought  to  the  scoring  task.  The  perspectives 
seemed  influenced  by  a  combination  of  scoring  approaches, 
particular  biases  or  preferences,  and  features  the  readers 
valued  most  in  timed  writings,  as  well  as,  undoubtedly,  by 
personal  and  background  factors  not  under  consideration. 


142 


(In  a  recent  study  Martin  (1987)  found  these  latter  factors 
to  be  especially  important.) 

These  patterns  of  response  could  not  be  attributed  to 
gender,  race,  or  instructional  level.  Rather,  the  readers 
seemed  randomly  linked  with  one  another  in  some  of  the 
writing  features  they  valued  or  in  some  of  the  approaches 
they  used.  But  the  word  some  is  important  to  emphasize 
in  this  regard,  for  as  Figure  7  illustrates,  readers  agreed 
on  the  significance  of  focus,  organization,  unity,  and 
fluent  sentence  style;  they  generally  agreed  on  the 
relative  unimportance  in  timed  essays  of  creativity,  the 
writer's  commitment  to  the  topic,  and  spelling  or  punctu- 
ation. Thus,  the  individuality  of  readers'  perspectives 
on  writing  was  clearly  grounded  in  shared  beliefs  that 
undoubtedly  contributed  to  the  high  interrater  reliability 
discussed  under  questions  2  and  3. 

Nature  of  the  Monitoring 

Question  6:  What  is  the  nature  of  the  monitoring  that 
the  readers  receive  during  a  scoring  as  reported  through 
the  logs  of  chief  readers,  table  leaders,  and  readers?  Do 
the  procedures  noted  in  these  logs,  together  with  the 
protocols  of  the  special  readers,  support  the  readers' 
perceptions  of  their  own  holistic  scoring  processes  as 
noted  on  the  questionnaire? 

The  nature  of  the  monitoring  in  an  holistic  scoring 

was  determined  through  an  examination  of  the  following 

data:   (a)  the  logs  of  the  table  leaders  and  chief  readers, 

(b)   their  interactions  with  readers  as  reported  both 


143 


through  the  readers'  scoring  logs  and  the  special  readers' 
tapes,  and  (c)  the  questionnaires  given  all  participants. 
In  addition,  the  researcher  recorded  observations  of  the 
monitored  scoring  in  progress,  the  results  of  which  were 
reported  under  Question  5 . 

Roles  and  Questionnaire  Responses  of  the  Chief  Readers 
and  Table  Leaders 

Like  the  readers,  the  five  people  responsible  for 
their  training  and  monitoring  in  the  monitored  holistic 
scoring — e.g.,  the  two  chief  readers  and  three  table 
leaders — came  from  universities,  community  colleges,  and 
high  schools.  For  the  unmonitored  portion  of  the  scoring, 
all  had  been  given  different  tasks  from  the  readers:  The 
table  leaders  read  the  39  original  samples  and  range- 
finders,  recording  their  comments  in  a  log.  The  chief 
readers,  who  had  been  involved  in  the  original  selections 
of  these  samples  two  years  previously,  read  the  112  papers 
initially  chosen  for  the  study,  recording  their  comments 
about  each  paper  in  a  log. 

As  in  the  readers'  case,  the  chief  readers'  and  table 
leaders '  questionnaires  reflected  some  individuality  of 
response  to  various  writing  features.  For  example,  on  the 
questionnaire  all  five  rated  an  adequate  controlling 
idea  and  development  as  "very  important"  in  their  writing 
judgments;  similarly,  they  judged  sentence  fluency,  variety 
of  sentence  structure,  and  accurate  diction  as  significant 


144 


also.  However,  as  Figures  8  and  9  indicate,  the  chief 
readers  and  table  leaders  disagreed  about  the  importance  of 
fragments  and  run-ons,  of  dialect/  ESL  errors,  and  of  a 
conclusion. 

The  logs  corroborated  at  once  the  similarities  and 
differences  between  the  two  chief  readers.  Although  each 
one  responded  to  the  degree  of  development  in  papers  and 
to  sentence  constructions,  distinctive  responses  also 
appeared.  Whereas  the  first  chief  reader  referred  to 
problems  in  general  terms ,  such  as  "multiple  language 
errors"  or  "mechanical  problems,"  the  second  chief  reader 
was  more  likely  to  identify  the  exact  nature  of  the 
errors,  using  such  labels  as  "run-ons,"  "verb  endings,"  and 
"apostrophes."  Similarly,  whereas  the  first  chief  reader 
noted  the  content  of  essays  more  freguently  than  did  the 
associate  chief  reader,  the  second  chief  reader  commented 
more  often  than  the  first  on  the  nature  of  introductions  or 
conclusions.  At  the  same  time,  both  chief  readers  were 
concerned  about  focus,  and  both  characterized  essays 
globally:  Chief  Reader  1  talked  about  the  "thorough 
treatment  of  the  topic"  in  given  essays  or,  conversely, 
an  essay  that  was  "bland  and  superficial";  the  second 
chief  reader  commented  on  papers  which  reflected  either 
"competent  writing"  or  "ho-hum  writing. " 

The  table  leaders '  logs  showed  their  individuality  as 
well.   Table  Leader  1  expressed  concern  with  organization 


145 


Category 

Adequate  Controlling 
Idea 


Table   Table 
Leader  Leader 
No.  1   No.  2 


Table   Chief   Chief 
Leader  Reader  Reader 
No.  3   No.  1 


No. 


Write-in  Category 
Logic,  Reasoning 


Note;   4  =  Most  important. 
1  =  Least  important. 


Total 


20 


Focus 

4 

3 

3 

4 

4 

18 

Depth  of  Thought 

3 

3 

3 

4 

4 

17 

Development 

4 

3 

3 

4 

4 

18 

Organization 

4 

4 

2 

3 

4 

17 

Introduction 

4 

2 

2 

2 

3 

13 

Conclusion 

4 

1 

1 

2 

3 

11 

Commitment  of  Writer 

3 

4 

3 

3 

2 

15 

Unity 

4 

3 

3 

3 

4 

17 

Tone 

4 

3 

3 

3 

3 

16 

Creativity 

1 

2 

3 

3 

2 

11 

Fluent  Sentence  Style 

4 

3 

4 

4 

4 

19 

Variety  of  Sent.  Struct. 

3 

3 

4 

3 

4 

17 

Accurate  Diction 

4 

3 

3 

4 

4 

18 

Mature  Diction 

4 

2 

3 

4 

4 

17 

Avoid.  Fragments,  Run-ons 

4 

4 

2 

2 

4 

16 

Avoid.  Tangled  Sentences 

4 

3 

3 

3 

4 

17 

Avoid.  Usage  Errors 

3 

2 

2 

2 

4 

13 

Avoid.  Dialect  Errors 

3 

4 

2 

2 

4 

15 

Avoid.  ESL  Errors 

3 

4 

3 

2 

4 

16 

Spelling 

2 

3 

2 

2 

3 

12 

Punctuation 

3 

3 

3 

2 

3 

14 

Capitalization 

3 

4 

1 

2 

3 

13 

Length 

3 

3 

2 

2 

4 

14 

Figure  8.   Summary  of  points  assigned  to  writing  criteria  by 
chief  readers  and  table  leaders . 


146 


Most 
Important 


20*   Adequate  Controlling  Idea 


19   Fluent  Sentence  Style 


18   Focus/Development/Accurate  Diction 


17  Depth  of  Thought/Organization/Unity/Variety  of  Sentence 
Structure/Mature  Diction/Avoidance  of  Tangled  Sentence 
Structures 

16  Tone/Avoidance  of  Fragments  and  Run-ons/Avoidance  of  ESL 
Errors 


Least 
Important 


15    Commitment  of  Writer/Avoidance  of  Dialect  Errors 


14    Punctuation/Length 


13    Introduction/Avoidance  of  Usage  Errors/Capitalization 


12   Spelling 


11        Conclusion/Creativity 


*Total  points   assigned  by   5   chief  readers   and  table   leaders. 

(The   maximum  number   possible    is    4   points   per   criterion   per    individual.) 


Figure   9 .      Ratings     by     chief     readers     and     table     leaders     of 
importance   of   criteria   in  timed  writing. 


147 


and  vocabulary ,  with  voice  and  style,  with  a  writer's 
sense  of  control,  and  with  a  writer's  ability  to  make  a 
paper  interesting.  Table  Leader  2  remarked  on  introduc- 
tions and  conclusions,  took  note  of  pertinent  details 
and  logical  support,  and  identified  types  of  errors 
specifically.  This  table  leader  responded  especially 
to  repetition  and  diction.  Table  Leader  3,  like  Table 
Leader  2,  also  commented  on  repetition  and  on  the  presence 
of  errors;  in  addition,  he  called  attention  to  the  thesis 
and  focus  of  papers,  as  well  as  to  diction  and  logic. 

An  excerpt  from  the  logs  illustrates  the  table  leaders' 
shared  interest  in  organization,  as  well  as  their  different 
perspectives.  About  one  sample,  which  received  two  scores 
of  3  and  one  score  of  2+  from  the  table  leaders,  Table 
Leader  1  wrote  the  following  comment:  "organized — clear — 
held  my  interest — movement — strong  voice  (vocabulary  and 
rhythm  limitations  make  paper  a  3,  not  a  4 )  .  "  Table 
Leader  3  noted  about  the  same  sample,  "Can't  spell  his  sub- 
ject. Organized,  focused — but  spelling!"  Table  Leader  2 
commented,  "Spelling  problems;  good  details — obviously 
knows  his  topic.  Many  good  sentences — logically  well 
organized. " 

Thus,  like  the  readers,  the  two  chief  readers  and  three 
table  leaders  brought  their  individual  perspectives  to  the 
evaluation  of  each  essay.  At  the  same  time,  as  Figure  B-4 
in  Appendix  B  indicates,  the  table  leaders'  scores  on  the 


148 


samples  showed  substantial  agreement.  Only  on  one  out  of 
39  scores  did  the  table  leaders  disagree;  on  the  remaining 
papers  the  scores  were  either  identical  or  contiguous. 

Training  with  Rangefinders  and  Samples 

Before  the  monitored  scoring  began,  the  table  leaders 
reported  for  a  discussion  of  the  samples  and  the  actual 
scores  that  had  been  assigned  two  years  before.  This  short 
meeting  took  the  place  of  the  formal  table  leaders'  session 
which  is  customarily  held  the  day  before  any  holistic 
scoring  and  at  which  table  leaders  can  disagree  with  any 
samples  selected  by  the  chief  readers. 

The  table  leaders '  monitoring  tasks  formally  began  with 
the  discussion  that  ensued  after  the  rangefinders  were 
tallied  at  the  start  of  the  monitored  scoring;  on  this 
occasion,  most  readers  found  the  rangefinders  to  be  good 
indicators  of  each  scoring  level,  and  the  discussions  were 
brief  as  a  result.  However,  Table  Leader  1  did  work  with 
Reader  ID  who  inquired  why  paper  T  was  not  an  upper-half 
paper . 

Eleven  other  samples  were  presented  at  intermittent 
intervals  throughout  the  scoring  to  prevent  the  readers 
from  drifting  away  from  the  standards.  Each  time  the 
scores  were  either  identical  or  contiguous  on  these 
samples,  and  no  discussions  occurred.  However,  Table 
Leader  1  recorded  in  the  log  that  one  reader  had  scored 


149 


sample  FF  as  a  3  and  then  silently  changed  his  score  to  a 
2  when  no  other  readers  at  the  table  showed  a  similar 
reading  to  his.  Referring  to  the  score  given  by  dozens  of 
readers  in  the  original  scoring  two  years  previously,  the 
table  leader  concluded,  "Sample  FF  was,  in  fact,  a  3  paper. 
The  reader  had  succumbed  to  group  pressure  and  awarded  the 
group's  score,  not  his  score." 

This  incident  illustrated  the  importance  of  readers' 
maintaining  confidence  in  their  own  judgment  and  in  their 
own  ability  to  adhere  to  the  standards.  The  "procedures" 
section  of  some  readers'  logs,  as  well  as  the  tapes, 
partially  conveyed  the  extent  to  which  the  samples  and 
rangefinders  helped  align  readers  with  these  standards. 
( Information  from  the  procedures  is  limited  in  that  only 
four  of  the  eight  readers  actually  completed  this  portion 
of  their  monitored  logs;  apparently,  they  did  not  all 
understand  the  importance  of  completing  this  section,  nor 
could  this  part  of  their  task  be  emphasized  as  readers  were 
not  to  know  the  nature  of  the  study.)  Nevertheless,  nearly 
all  the  essays  that  were  scored  immediately  following  the 
rangefinders  received  scores  of  either  2  or  3..  That  so 
many  of  these  early  essays  were  middle-range  papers  is  not 
surprising;  in  fact,  during  the  unmonitored  scoring,  Reader 
3D,  upon  giving  his  first  paper  a  score  of  4_,  noted  that  it 
was  difficult  to  begin  scoring  with  extremes.  Notwith- 
standing the  difficulty  of  initially  assigning  scores  at 


150 


either  end  of  the  range,  two  of  the  essays  scored 
immediately  after  the  rangefinders  received  scores  of  1. 
More  important,  in  all  cases  the  early  scores  proved  to  be 
accurate  in  that  when  the  same  essays  were  scored  later 
during  the  session  by  other  readers,  the  papers  received 
the  same  scores. 

The  impact  of  the  training  samples  could  be  seen  in 
the  scoring  patterns  that  occurred  immediately  before  or 
after  breaks.  At  one  point,  Reader  3D,  upon  returning 
from  a  break,  noticed  that  he  was  scoring  low,  and  he 
expressed  concern  as  to  whether  something  had  happened  to 
his  own  sense  of  the  standards.  Moreover,  Reader  3A 
speculated  that  grades  before  lunch  might  be  lower  than 
grades  after  lunch  when  readers  were  satisfied.  Indeed, 
other  readers  had  implied  similar  concerns  on  their 
questionnaires  when  they  noted  that  fatigue,  a  post-lunch 
slump,  or  even  room  temperature  could  affect  their  scoring 
processes  (see  item  46  of  Table  5).  However,  the  data  did 
not  support  Reader  3A's  conjecture:  The  scores  before 
lunch  were  always  comparable  to  those  given  at  other  times 
of  day  to  the  same  essay;  in  fact,  one  score  before  lunch — 
a  4 — was  higher  than  any  other  scores  the  same  essay 
received.  Scores  given  after  lunch  when  samples  were  again 
provided  were  also  representative  of  the  other  scores 
assigned  those  particular  papers  at  other  times  in  the  day. 
Readers  made  virtually  no  comments  about  the  samples 
in  their  logs  or  on  their  tapes:   The  only  reference  came 


151 


from  Reader  3D,  who  noted  that  both  a  sample  and  an  essay 
he  read  immediately  after  the  sample  had  noticeably  large 
handwriting.  Thus,  the  inclusion  of  samples  seemed  an 
integral  part  of  the  holistic  training  procedures — almost 
taken  for  granted  by  the  readers  but,  presumably,  helping 
them  to  focus  in  on  the  standards . 

Monitoring  by  the  Table  Leaders 

Readers  were,  however,  frank  about  relying  on  the 
table  leaders  to  confirm  their  judgments .  Even  during  the 
unmonitored  scoring,  one  reader  noted  that  if  it  had  been 
possible,  she  would  have  asked  the  table  leader  for  help 
with  certain  essays.  Similarly,  in  the  monitored  scoring, 
the  tapes  and  logs  revealed  several  instances  in  which 
readers  also  turned  to  the  table  leaders  for  advice:  For 
example,  Reader  3C  deliberately  sought  out  her  table  leader 
about  paper  073  which  she  had  found  difficult  to  score,  and 
Reader  3D  mentioned  on  tape  that  he  would  have  asked  the 
table  leader  about  a  paper  if  the  table  leader  (who  was 
moving  among  four  different  offices  during  the  taping  part 
of  the  monitored  scoring)  had  been  available. 

Several  log  notations  of  the  readers  at  each  table 
confirmed  the  readers '  views  of  table  leaders  as  helpful 
resource  people.  Reader  1A  initiated  a  discussion  with  the 
table  leader  about  paper  006,  and,  as  both  their  logs 
indicated,  a  discussion  ensued  about  organization,  surface 


152 


errors,  conciseness  and  development.  Late  in  the  scoring 
Reader  1C  inquired  whether  her  score  on  paper  103  was 
too  high;  the  table  leader  suggested  that  she  consider 
"language  and  structural  weaknesses"  in  determining  her 
score. 

At  table  2  readers  also  turned  to  their  table  leader: 
Reader  2A  pointed  out  a  curious  sentence  from  paper  045, 
and  Reader  2B  asked  the  table  leader  if  paper  051  could 
possibly  be  a  4.  Thus,  part  of  the  monitoring  was  clearly 
reader-initiated,  as  the  readers  sought  out  the  table 
leaders  for  brief  discussions  of  problematic  papers. 

More  commonly,  the  table  leaders  would  initiate  the 
interaction  after  they  reviewed  essays  selected  at  random 
from  each  reader's  set  of  scored  papers.  The  table  leaders 
would  read  the  selected  papers,  assign  them  an  independent 
score,  comment  on  the  essays  in  their  own  logs,  and  then 
look  at  the  particular  reader's  corresponding  score  and 
comment.  If,  as  Table  Leader  1  noted,  there  was  "easy, 
rapid  agreement"  on  certain  papers  that  were  classic 
representatives  of  the  score  levels,  no  conferences  were 
likely  to  occur.  Occasionally,  a  table  leader  would  confirm 
the  score  aloud  with  the  reader,  as  when  Table  Leader  2 
said  of  a  paper  given  a  final  score  of  2=_,  "I  almost  gave 
it  a  lr"  and  Reader  2C  exclaimed,  "I  almost  gave  it  a  1 
also!"  Similarly,  Table  Leader  3  conferred  with  Reader  3A 
about  paper  010  on  which  they  had  given  identical  scores. 


153 


Table  Leader  3  noted  in  his  log,  "Reviewed  for  2-3 
minutes — [we  agreed]  on  positive  aspects  of  the  paper. 
Reader  thinks  facts  are  also  a  problem — invented."  Thus, 
when  papers  were  particularly  noteworthy — either  for  being 
very  good  or  very  bad — discussions  occasionally  would  occur 
even  when  identical  scores  were  assigned.  More  often  than 
not,  however,  no  conferences  would  arise. 

Few  discussions  appeared  to  take  place  when  contiguous 
scores  were  considered  accurate.  For  example,  Table 
Leader  3  noted  about  paper  065,  to  which  he  had  given  a 
score  of  2  and  the  reader  a  score  of  3,  the  comment,  "No 
need  to  review — I  think  it's  a  2+  or  3-. "  Similarly,  Table 
Leader  2  noted  about  paper  081,  "Reader  2C  scored  this  a  3 
and  I  a  2  +  .  This  discrepancy  does  not  seem  major.  The 
reader  offered  to  look  at  this  paper  again,  but  I  did  not 
feel  this  to  be  necessary."  Thus,  although  contiguous 
scores  may  appear  to  be  quite  different — and  indeed  they 
are  at  times — a  score  of  2  and  another  score  of  3  or  a 
score  of  4  and  another  of  _3  may  be  an  accurate  assessment 
of  a  paper.  The  score  ranges  are  broad,  and  what  one 
reader  may  perceive  as  a  high  1,  for  example,  another 
reader  may  see  as  a  low  2. 

Discussions  did  ensue  when  table  leaders  thought  that 
scores  were  discrepant  or  when  the  chief  readers  returned 
papers  with  what  they  believed  to  be  inaccurate  scores. 
The  log  notation  of  Table  Leader  1  was  representative  of 


154 


such  an  instance:   "I  discussed  002  with  Reader  1C,  a  paper 

from  a  check  reading  which  she  had  overrewarded .   Little 

interaction  after  re-reading;  she  said  she  saw  what  I  meant 

about  the  paper  being  a  1,     not  a  2 .  "   The  log  of  Table 

Leader  2  revealed  a  similar  occurrence: 

The  Chief  Reader  brought  paper  045  back  from  check 
reading:  Reader  2A  1;  TL  2,  2;  Head  Table  3 — oops! 
Reader  2A  reread  and  decided  's-v  problems, 
slippery  syntax' — would  change  her  score  to  a  2  if 
they  want  her  to.  But  I  think  there  are  enough 
problems  to  keep  it  in  lower  half:  sp,  punc . , 
word  endings,  diction. 

The  change  in  scores  that  these  returned  papers 

generated  was  not  a  frequent  occurrence:   The  logs  of  all 

three  table  leaders  revealed  only  a  few  instances  in  which 

the  readers  actually  changed  scores  after  rereading  and 

discussing  questionable  papers  with  the  table  leaders. 

What  happened  more  frequently  instead  was  that  papers  with 

contiguous  or  even  identical  scores  served  as  a  springboard 

for  brief  conversations  about  writing.   For  example,  Table 

Leader  wrote  about  paper  054  to  which  she  had  assigned  a 

score  of  3_^  and  Reader  2C  a  score  of  3:   "[The  reader] 

liked  the  personal  experience  a  bit  more  than  I .   Some 

of  our  comments  were  the  same;  she  felt  that  we  were  'on 

the  same  wave  length.'"   Similarly,  Reader  ID  asked  his 

table  leader  about  the  probable  source  of  troublesome 

sentences  in  paper  083;  together  they  discussed  whether 

it  might  be  due  to  the  writer's  limited  vocabulary  or  to 

logic  problems.  Still  another  instance  occurred  when  Table 


155 


Leader  3  conferred  with  his  reader  about  the  strong 
organization  and  competent  style  of  one  paper  that  over- 
came its  minor  logic  problems . 

The  tapes  further  illustrated  the  ongoing  nature 
of  discussions  between  readers  and  table  leaders.  In 
reviewing  a  probable  score  of  4  or  3  for  paper  010,  Table 
Leader  3  recounted  to  Reader  3C  his  own  mental  debate  as  to 
whether  the  strong  development  offset  sufficiently  the 
errors  which  pulled  the  paper  down.  When  Reader  3C  raised 
additional  questions  about  the  content  of  the  paper,  the 
table  leader  concluded,  "Right.  So  if  you  want  to  give  it 
a  4,  that's  fine,  but  if  you  want  to  knock  it  down  because 
of  the  facts  and  the  errors  to  a  3,  that  would  be  fine. 
There's  no  doubt  it's  an  upper-half  paper." 

On  no  occasion  were  there  any  signs  that  the  table 
leaders  made  readers  change  their  scores  on  questionable 
papers .  When  Reader  3C  asked  the  table  leader  whether  she 
should  change  her  score  of  1  on  paper  013  to  a  2,  Table 
Leader  3  laughingly  replied,  "No,  I  don't  plan  to  beat  you 
over  the  head  and  make  you  change  it  to  a  2.  It's  not 
warranted.  I  just  wanted  to  make  sure  we  were  talking  more 
or  less  about  along  the  same  lines." 

The  table  leaders  recognized  that  they  themselves  could 
be  in  error.  For  example,  Table  Leader  2,  writing  in  her 
log  about  paper  081,  to  which  she  had  given  a  2+,  noted, 
"Reader  2D  gave  this  a  3_.   Since  Reader  2C  did  the  same 


156 


this  morning,  I  guess  I'm  off  on  this  one."   The  awareness 

that  table  leaders  and  chief  readers  could  be  wrong, 

too,  was  also  shown  when,  as  discussed  previously,  Table 

Leader  2  disagreed  with  the  chief  reader's  judgment  of 

a  3  for  paper  045,  concluding  rather  that  the  extent  of 

problems  kept  that  paper  in  the  lower  half. 

But  if  the  table  leaders  and  chief  readers  did 

not  perceive  themselves  as  authority  figures  who  were 

necessarily  "right,"  a  brief  comment  by  one  or  two  readers 

conveyed  that  some  readers — on  some  occasions,  at  least — 

showed  a  deference  for  the  leaders'  judgment.   Not  only  did 

such  deference  appear  in  Reader  2A's  previously  cited 

willingness  to  change  her  score  after  a  check  reading  from 

a  1    to  a  2  on  paper  045  "if  they  wanted  her  to"  [italics 

added],  but  it  also  was  clearly  expressed  by  Reader  3B  in 

her  taped  response  to  paper  085: 

[The  paper]  begins  to  break  down  a  little  in  the 
end,  and  yet  it  is  well  said.  I'm  not  sure  .  .  . 
definitely,  at  least,  a  3.  I'm  not  sure  if  the 
table  leader  would  accept  a  4_.  I'm  going  to  stop 
just  a  minute  and  consult  my  4_  rangefinder.  The 
paper  is  not  as  good  as  the  4_  rangefinder.  It  has 
a  strong  introduction,  though.  I'm  going  to 
consult  my  table  leader  and  see  what  he  thinks . 

When  Table  Leader  3  agreed  that  the  sophistication  of 

thesis  overrode  what  he  called  "errors  of  inelegancies "  to 

make  the  paper  a  4,  Reader  3B  responded  enthusiastically. 

"Yeah,  okay.   Great!   I  feel  very  comfortable  with  that." 

Thus,  as  can  be  seen,  the  interaction  between  table 

leaders  and  readers  could  be  broadly  characterized  as 


157 


pleasant  and  congenial.  Tending  to  perceive  table  leaders 
as  helpful,  readers  turned  to  them  for  guidance;  they 
showed  respect  for  the  table  leaders,  and  some  readers 
expressed  a  willingness  to  alter  their  scores  if  necessary, 
even  though  the  table  leaders  did  not  convey  a  need  for 
doing  so.  This  respectful  attitude  is  especially  note- 
worthy in  view  of  the  fact  that  many  of  the  readers  in 
this  study  had  previously  served  as  table  leaders  several 
times.  Thus,  their  deference  may  derive  from  respect  for 
a  colleague's  judgment  rather  than  from  perceiving  someone 
in  an  authoritarian  position. 

Questionnaire  Responses  About  Monitoring 

Additional  insight  into  the  nature  of  monitoring 
was  provided  by  all  17  participants'  responses  to  one 
section  of  the  questionnaire.  As  indicated  in  Table  7, 
over  three-fourths  of  the  participants  believed  that 
regular  discussions  of  sample  papers  (item  48)  helped  to 
maintain  their  awareness  of  group  standards;  in  addition, 
10  respondents  stated  that  they  "almost  always"  or  "always" 
reexamined  rangefinders  or  operational  definitions  if  they 
needed  to  realign  their  scoring  standards.  Nearly  two- 
thirds  felt  that  it  was  "almost  always"  helpful  to  consult 
with  table  leaders  or  chief  readers  on  problem  papers ,  and 
59%  indicated  that  they  frequently  felt  free  to  disagree 
with  table  leaders  or  chief  readers  if  they  considered  them 


TABLE  7 
Questionnaire  Results  Dealing  with  Training 


158 


Questionnaire 
Item  No. 


Always/ 
Almost  Always 


Often/ 
Sometimes 


Seldom/ 
Never 


Other 


38)  Easier  to  score         10  (59%) 
essays  in  a 

structured  setting 
than  at  home 

39)  Being  "off"  in  — 
scoring  a  sample 

shakes  confidence 

40)  Helpful  to  consult      11  (65%) 
with  table  leaders 

on  problem  papers 

41)  Feel  free  to  dis-       10  (59%) 
agree  with  table 
leaders/chief 

readers  if  wrong 

42 )  A  returned  paper         1(6%) 
affects  subsequent 

scoring  process 

44)  Knowing  that  essays       — 
are  checked  is 
troublesome 

48)  Samples  papers  help     13  (76%) 
to  keep  group 

standards  in  mind 

49)  Use  the  rangefinders    10  (59%) 
or  definitions  to 

realign  standards 

52)  Discussions  with        16  (94%) 
readers,  table 
leaders,  and  chief 
readers  are  collegial 


3  (18%)     3  (18%)    1  (  6%) 
same 


5  (29%)    12  (71%) 


4  (24%)     2  (12%) 


6  (35%)      1  (  6%) 


11  (65%)      5  (29%) 


2  (12%)     15  (88%) 


3  (18%)      1(6%) 


4  (24%)      3  (18%) 


1  (  6%) 


159 


wrong.  Thus,  the  questionnaire  responses  confirmed  the 
role  of  table  leaders  as  revealed  through  the  logs  and 
tapes — namely,  that  table  leaders  served  as  guides  or 
consultants,  rather  than  as  authority  figures.  This 
guiding  role  may  explain  why  59%  of  the  study  participants 
agreed  that  it  was  "always"  or  "almost  always"  easier  to 
score  papers  in  a  structured,  monitored  setting  than  at 
home.  Congeniality  was  also  important,  as  94%  agreed  that 
their  discussions  of  essays  with  other  readers  could 
"always"  or  "almost  always"  be  characterized  as  collegial. 
In  view  of  White's  emphasis  (1985)  on  the  importance  of  a 
supportive,  congenial  atmosphere,  this  finding  was 
important . 

A  picture  of  self-confidence  emerged  from  the 
questionnaires,  with  71%  admitting  that  they  were  "seldom" 
or  "never"  shaken  by  being  incorrect  in  the  scoring  of  a 
sample  essay.  In  a  similar  vein,  88%  indicated  that  they 
were  "seldom"  or  "never"  bothered  by  knowing  that  their 
scores  were  being  checked.  Despite  the  apparent  self- 
confidence,  nearly  two-thirds  admitted  that  the  return 
of  a  paper  at  least  sometimes  affected  their  scoring  of 
the  subsequent  few  papers .  Their  response  suggests  that 
whenever  they  are  asked  to  reread  an  essay  because  of  an 
inaccurate  score,  they  may  become  particularly  attentive  to 
their  own  scoring  processes — at  least  for  the  immediate 
period  afterward. 


160 


A  similar  portrait  of  self-confidence  emerged  from  the 
additional  questions  asked  of  scorers  who  had  frequently 
served  as  table  leaders  or  chief  readers  in  the  past.  (See 
Table  8.)  Two-thirds  of  the  12  respondents  indicated  that 
they  "seldom"  or  "never"  had  difficulty  in  identifying  the 
more  discrepant  essay  out  of  those  they  were  asked  to 
referee,  nor  did  they  find  it  difficult  to  deal  with 
readers  who  might  be  unwilling  to  adjust  scores  at  the 
table  leaders'  suggestions.  That  the  table  leaders 
acknowledged  the  possibility  of  readers'  being  right  was 
implied  in  the  response  of  "often"  or  "sometimes"  which 
three- fourths  of  the  scorers  gave  when  asked  whether 
disagreement  with  a  reader's  score  could  cause  them  to 
reconsider  their  own  judgment. 

Those  12  responding  to  the  additional  section  of 
the  questionnaire  did  not  agree  that  part  of  the  table 
leaders'  role  was  that  of  arbitrating  standards:  Four 
table  leaders  said  it  "almost  always"  was,  four  said  it 
"sometimes"  was,  two  said  it  "never"  was,  and  two  did  not 
answer.  Whether  the  question  was  ambiguous  or  whether  the 
respondents  interpreted  their  role  in  connection  with  the 
standards  differently  is  not  clear.  What  they  did  agree  on 
substantially — with  83%  marking  "almost  always" — was  that 
monitoring  is  an  effective  means  for  helping  the  group  to 
adhere  to  group  standards . 


161 


TABLE    8 
Additional   Questions   for   12   Table  Leaders   and  Chief   Readers 


Questionnaire 
Item  No. 


Always/ 
Almost  Always 


Often/ 
Sometimes 


Seldom/ 
Never 


Other 


55)  Read  problem  papers 
holistically  and 
analytically 


11  (92%) 


1  (18%) 


56)  Part  of  role  is  to 
arbitrate  standards 

57)  Disagreement  with 
reader's  score 
causes  a  recon- 
sideration of  own 
judgment 


4  (33%)         4  (33%)    2  (17%)    2  (17%) 

no  resp. 

1(8%)        9  (75%)    2  (17%) 


59)  In  refereeing 
papers,  hard  to 
identify  the 
discrepant  score 


4  (33%)    8  (66%) 


60)  Difficult  to  deal 
with  uncooperative 
reader  about 
altering  score 


1  (  8%) 


3  (25%)    8  (66) 


61)  Monitoring  is 
effective  in 
helping  group  to 
adhere  to  group 
standards 


10  (83%) 


2  (17%) 


162 


Their  perspectives  on  monitoring  more  fully  appeared  in 
the  open-ended  question  that  asked  all  participants  to 
comment  on  how  monitoring  affects  the  interplay  between  the 
reader  and  the  essay  in  an  holistic  scoring.  Repeatedly, 
the  comments  stressed  the  importance  of  table  leaders' 
dealing  courteously  with  readers,  the  importance  of,  as 
Reader  2D  noted,  reinforcing  positively  what  readers  are 
doing,  and  the  importance  of  minimizing  any  intrusions  into 
the  scorers'  reading  processes.  Most  significantly,  the 
comments  stressed  the  positive  value  of  monitoring  as  a 
beneficial  procedure. 

Several  of  those  writing  from  only  a  reader's 
perspective  acknowledged  some  anxiety  at  being  checked, 
but  as  Reader  2C  admitted,  this  discomfort  was  good, 
serving  to  keep  them  on  their  toes  when  they  were 
tired  or  when  their  minds  had  wandered.  Reader  2A 
emphasized  that  "monitoring  by  an  excellent  table  leader — 
and  almost  always  they  are — is  a  real  help  and  support." 
The  thoughtful  comments  by  Readers  1A  and  3B,  noted 
below,  stand  as  eloquent  testimony  to  the  benefits  of 
monitoring: 


Reader  1A  As  a  table  leader,  I  have  observed  the  monitor- 
ing process  as  a  tempering  of  our  individual 
prejudices  and  preconceived  notions  about  how 
the  papers  should  be  graded.  We  must  set  aside 
our  whims,  caprices,  and  dogmatism  in  the 
interest  of  fairness  and  competency.  Readers, 
table  leaders,  and  chief  readers  balance  papers 
against  group  standards  adjusting  skillfully  as 
we  proceed. 


163 


Reader  3B  As  a  reader,  I  find  the  structure  useful, 
supportive,  reassuring,  and  congenial.  I  feel 
in  touch  with  the  standards .  I  have  resource 
people  available  to  me  when  I  have  questions . 
I  think  the  formal  setting  helps  me  deal  with 
essays  fairly.  The  monitoring  process  makes  the 
effort  a  collegial  attempt  to  establish  and 
share  certain  standards  and  values  among 
professional  colleagues,  and  the  students 
benefit  ultimately  from  that. 

Therefore,  as  can  be  seen  from  the  logs,  question- 
naires, and  tapes,  the  participants  in  this  study  perceived 
the  monitoring  process  to  be  a  positive  source  of  guidance 
and  support.  Rather  than  considering  it  as  dogmatic  or 
authoritarian,  they  envisioned  the  monitoring  as  a  resource 
for  scorers  and  a  springboard  for  a  discussion  among 
professionals  of  the  elements  of  writing.  Both  explicitly 
and  implicitly  the  participants  conveyed  that  scoring 
students '  essays  accurately  and  fairly  was  their  ultimate 
goal. 

The  summary  and  conclusions  of  the  study  are 
presented  in  Chapter  5 . 


CHAPTER  5 
SUMMARY  AND  CONCLUSIONS 


In  this  study  the  impact  of  monitoring  on  the  holistic 
scoring  of  essays  was  explored.  Although  previous  research 
on  this  topic  has  been  limited,  monitoring  is  central  to 
the  issues  of  the  validity  and  reliability  of  holistic 
scoring  as  a  writing  assessment  tool.  For  example,  Charney 
(1984)  argues  that  the  reliability  of  holistic  scoring 
derives  from  agreement  on  such  superficial  features  of 
writing  as  handwriting  or  spelling,  features  which  render 
holistic  ratings — expected  as  they  are  to  measure 
"substantive  skills" — invalid.  Charney  suggests,  further- 
more, that  the  very  need  for  training  to  help  holistic 
scorers  adhere  to  writing  criteria  which  are  both  pre- 
selected and  imposed  on  them  by  others  renders  the 
validity  of  the  resulting  holistic  scores  questionable.  In 
contrast,  White  (1985)  envisions  holistic  scorers  as 
comprising  an  "assenting  community"  much  like  the  "inter- 
pretive community"  depicted  by  the  reader  response  theorist 
Fish  (1980).  In  White's  view,  training  helps  holistic 
scorers  to  own  both  the  standards  and  the  process.  Thus, 
the  purpose  of  this  study  was  to  determine  how  training 
and  monitoring  influence  the  writing  judgments  holistic 
scorers  make. 


164 


165 


Procedures  Used 

Both  quantitative  and  qualitative  measures  were  used. 
Eight  high  school,  community  college,  and  university- 
teachers,  all  of  whom  were  highly  experienced  holistic 
scorers,  first  rated  at  home  over  50  expository  essays 
written  by  college  undergraduates.  The  essays  were 
selected  by  a  stratified  random  sampling  procedure  from 
essays  written  for  a  statewide  assessment  program  two  years 
previously.  In  addition  to  recording  their  scores  for  each 
essay — from  a  high  of  4  to  a  low  of  1 — the  readers  recorded 
their  responses  to  the  essays  in  written  logs.  An 
additional  four  readers,  comprising  a  team  of  "special 
readers,"  scored  a  subset  of  the  same  essays  and  recorded 
on  audiotapes  their  responses  to  the  essays . 

A  month  later  all  12  readers  assembled  for  a  formal, 
monitored  holistic  scoring  in  which  2  chief  readers 
provided  the  training  typical  of  formal  holistic  scorings; 
furthermore,  3  table  leaders  each  monitored  the  ongoing 
scoring  processes  of  4  readers.  The  readers  rated  another 
set  of  similar  expository  essays,  matched  beforehand  with 
the  first  set  of  papers  on  the  basis  of  scores  originally 
assigned  during  the  actual  holistic  scoring  two  years 
previously.  Again  the  readers  recorded  their  responses  to 
the  essays  either  in  logs  or  on  tapes;  all  participants 
answered  a  questionnaire  designed  specifically  for  the 
study. 


166 


Issues  Explored 
Six  sets  of  questions  were  addressed  in  the  study: 

1.  Do  the  mean  scores  for  the  essays  differ  when  the 
papers  are  evaluated  by  readers  working  in  a 
monitored  setting  from  when  the  papers  are  judged 
by  the  readers  working  independently? 

A  mixed-model  analysis  of  variance  for  nested  factors 
and  repeated  measures  was  used  to  answer  the  first 
question.  Statistically  significant  results  (p  <  .00005) 
were  obtained  when  the  mean  scores  of  the  51  pairs  of 
matched  essays  were  compared  in  the  unmonitored  (e.g.,  at- 
home)  condition  and  the  monitored  holistic  scoring 
condition.  The  mean  scores  given  to  essays  in  the  moni- 
tored condition  proved  to  be  lower  than  those  given  to  the 
matched  essays  when  the  readers  evaluated  the  first  set  of 
papers  at  home.  The  qualitative  data  supported  the 
quantitative  findings:  The  logs  of  the  table  leaders, 
together  with  the  check  reading  results,  indicated  that 
readers  who  tended  to  drift  high  in  the  unmonitored  scoring 
were  pulled  back  in  line  with  the  standards  during  the 
monitored  scoring. 

2.  Do  experienced  readers  participating  in  a  monitored 
scoring  achieve  greater  agreement  with  each  other 
than  when  they  evaluate  essays  independently? 

An  interrater  reliability  coefficient  of  over  .91 — that 

is,  .936  in  the  unmonitored  and  .915  in  the  monitored — was 

obtained  with  Cronbach's  alpha  in  both  the  monitored  and 

unmonitored  scoring  conditions.  Basically,  then,  the  eight 


167 


readers  keeping  the  logs  agreed  on  the  scores  they  assigned 
the  essays.  (As  noted  previously,  the  four  special  readers 
taping  their  responses  to  a  subset  of  the  essays  were 
not  included  in  the  statistical  procedures.)  However, 
additional  analysis  revealed  that  on  twice  as  many  essays 
in  the  unmonitored  condition  as  in  the  monitored  a 
potential  existed  for  discrepant  scores  among  certain 
readers  if  two  readers  were  paired.  Thus,  monitoring 
appeared  effective  in  increasing  agreement  among  the 
readers . 

3.  What  impact  do  the  chief  readers  have  on  an  holistic 
scoring?  How  do  they  ensure  both  a  reliable  and  a 
collegial  reading? 

A  second  Cronbach's  alpha  was  run  with  the  chief 

readers'  scores  included.    As  might  be  expected,  the 

results  of  the  second  alpha  did  not  differ  substantially 

from  the  results  of  the   first  alpha.    That  is,   a 

coefficient  of   .9474  was  obtained  in  the  unmonitored 

condition  and  .9358  in  the  monitored  condition.  Additional 

analysis  showed  that  readers  in  the  monitored  condition 

more  closely  approximated  the  chief  readers'  scores  than 

they  did  in  the  unmonitored  scoring.   If  chief  readers' 

scores  can  be  considered  "true"  scores  because  of  chief 

readers'  experience  with,  and  commitment  to,  the  standards, 

the  monitoring  appeared  effective  in  helping  readers  score 

more  accurately. 


168 


During  the  monitored  scoring,  which  the  researcher 
observed,  the  chief  reader  drew  the  readers  in  line  with 
the  standards  not  only  through  the  brief  comments  he  made 
about  the  sample  essays  but  also  in  the  very  samples  he 
selected  for  reviewing.  Although  the  public  tallying  of 
scores  on  rangefinders  and  other  training  papers  obviously 
entailed  some  peer  pressure,  no  criticisms  were  ever  made. 
Rather,  readers  whose  scores  appeared  to  be  discrepant  were 
asked  in  a  general  manner  to  look  a  particular  paper  over 
or  to  reconsider  their  scores.  Through  these  means  the 
professionalism  of  the  readers  was  acknowledged. 

4.  What  criteria  do  readers  use  in  assigning  different 
score  levels?  What  standards  are  reflected  in  the  score 
levels  assigned  across  essays?  How  do  readers  respond  to 
these  standards? 

The  logs  and  taped  protocols  demonstrated  the  criteria 
that  readers  used  in  assigning  different  scores  to  papers. 
Most  readers  considered  4-level  papers  as  strong  or 
distinctive  essays,  reflecting  a  depth  of  ideas,  solid 
development,  good  organization,  and  coherence.  Because  of 
the  strengths  they  associated  with  4-papers ,  readers  tended 
to  assign  that  score  sparingly. 

Readers  also  made  positive  comments  about  3-level 
papers,  although  their  responses  to  essays  at  this  level 
included  some  criticism  of  problems  in  either  rhetorical  or 
mechanical  areas.  More  negative  than  positive  comments 
appeared  about  essays  given  scores  of  2;     the  comments 


169 


reflected  particular  concern  about  sentence  structure, 
mechanics,  and  usage.  Several  readers  commented  on  the 
shallowness  and  mechanical  quality  of  many  2   papers . 

Unlike  some  work  (Haswell,  1988)  in  which  readers' 
judgments  of  1-level  papers  appeared  overly  restricted  and 
simplistic,  readers  of  this  study  responded  to  a  variety 
of  problems  in  1  papers,  occasionally  even  singling  out 
some  good  quality,  such  as  concrete  details.  As  might  be 
expected,  however,  the  responses  to  1  papers  were  generally 
negative. 

In  some  cases,  the  essays  themselves  defied  ready 
categorization  by  score  level.  Whereas  some  essays 
appeared  to  be  classic  2's  or  clearcut  3/s,  other  papers 
contained  qualities  of  more  than  one  level;  hence,  in 
evaluating  such  papers,  readers  were  required  to  balance 
strengths  against  weaknesses  in  an  effort  to  determine 
which  qualities  prevailed.  The  special  readers'  taped 
protocols  revealed  the  difficult,  ongoing  process  of 
adjustment  entailed  in  some  scoring  decisions.  Even  in  the 
logs,  a  few  readers  voluntarily  added  pluses  or  minuses  or 
arrows  to  convey  the  direction  of  a  particular  score  and 
their  own  difficulty  in  making  that  determination. 

On  the  issue  of  ownership  of  standards,  questionnaire 
results  revealed  that  the  17  participants  in  the  study — the 
12  readers,  3  table  leaders,  and  2  chief  readers — agreed 
conceptually  with  holistic  scoring  and,  with  the  exception 


170 


of  two  participants,  endorsed  the  standards  used  in  the 
statewide  assessment  procedure.  Even  the  two  who  responded 
that  these  standards  were  too  low  stated  that  they  could 
work  comfortably  with  the  existing  standards .  Virtually 
all  participants  believed  they  had  little,  if  any,  trouble 
adhering  to  the  standards,  and  most  emphasized  that  they 
were  able  to  distinguish  between  their  task  of  assigning 
a  score  and  the  consequences  the  scores  would  have  for 
the  students . 

5 .  Do  any  common  patterns  appear  in  the  scorers ' 
written  or  audiotaped  responses  to  the  essays,  or  do  their 
comments  underscore  the  individuality  of  each  reader's 
transaction  with  the  text?  Do  readers'  holistic  judgments, 
as  shown  by  their  written  or  verbal  responses  correspond  to 
the  writing  features  they  rate  as  important  on  a  question- 
naire devised  for  this  study? 

Another  finding  of  the  mixed  model  ANOVA  was  that 
statistically  significant  differences  (p  <  .001)  existed 
among  the  readers .  The  logs  and  tapes  underscored  this 
individuality.  For  example,  whereas  two  or  three  readers 
often  responded  to  introductions  and  conclusions,  other 
readers  seemed  especially  attuned  to  matters  of  diction  and 
content.  Similarly,  whereas  a  few  readers  tended  to 
identify  each  error  by  naming  it  specifically,  others  used 
such  umbrella  terms  as  "language  skills"  or  "mechanical 
problems"  to  label  problems. 

The  individuality  of  the  participants  was  again 
apparent  in  their  self-reporting  of  biases  and  preferences 


171 


on  one  section  of  the  questionnaire.  Although  virtually 
all  responded  positively  to  creativity,  humor,  and  evidence 
of  a  delightful  writer,  only  some  scorers  liked  rhetorical 
devices  or  technical/scientific  papers,  whereas  others 
clearly  did  not.  Similarly,  whereas  some  viewed  first- 
person  narratives  positively,  others  disliked  this 
approach.  The  respondents  to  the  questionnaire  identified 
a  number  of  different  strategies  for  dealing  with  papers 
that  triggered  their  personal  biases  and  preferences;  the 
methods  included  slowing  their  reading  down,  occasionally 
rereading  an  essay,  reexamining  the  operational  defini- 
tions, or  consulting  with  the  table  leader. 

Still  other  evidence  of  the  individuality  of  the  12 
readers  occurred  in  the  varying  degree  to  which  some 
acknowledged  or  interacted  with  the  writer  behind  the 
paper;  it  also  occurred  in  the  varying  degree  to  which  some 
readers — especially  those  doing  the  talking  protocols — 
speculated  as  to  the  causes  of  certain  errors  and  their 
possible  remedies. 

Despite  the  evidence  of  individuality,  readers  clearly 
shared  certain  beliefs  especially  regarding  the  importance 
of  such  elements  as  development,  focus,  and  sentence 
structure.  In  this  respect  the  readers  resembled  those 
in  Sweedler-Brown ' s  study  (1985). 

The  writing  criteria  which  most  participants  marked 
on  their  questionnaires   as   being   "very  important"   or 


172 


"important"  often  appeared  in  the  logs  and  on  the  tapes, 
thereby  corroborating  the  significance  of  these  features  to 
the  readers.  For  example,  four  readers  who  admitted  to 
valuing  word  choice  highly  responded  frequently  to  the 
diction  they  saw  in  the  essays;  similarly,  another  reader 
who  rated  content  as  very  significant  also  commented  often 
in  his  logs  about  the  quality  of  the  ideas  he  saw  in 
essays.  However,  discrepancies  appeared  occasionally  as 
well  in  that  criteria  which  some  readers  rated  as  being 
only  "somewhat  important"  or  "not  very  important" — criteria 
such  as  introductions  or  conclusions — appeared  frequently 
in  those  readers '  comments  in  the  logs  and  on  tapes .  This 
discrepancy  could  possibly  be  attributed  to  such  factors 
as  the  ambiguity  of  the  terms  "somewhat"  and  "not  very"  on 
the  questionnaire  or  the  momentum  of  the  scoring  task 
itself,  which  might  have  prevented  readers  from  expanding 
on  their  responses . 

6 .  What  is  the  nature  of  the  monitoring  that  the 
readers  receive  as  reported  through  the  logs  of  table 
leaders  and  readers?  Do  the  procedures  noted  in  these 
logs,  together  with  the  protocols  of  the  special  readers, 
support  the  readers'  perceptions  of  their  own  holistic 
scoring  processes  as  noted  on  the  questionnaire? 

The  monitoring  did  not  appear  dictatorial  at  any  level . 

In  fact,  readers  indicated  on  their  questionnaires  that 

they  generally  found  their  table  leaders '  monitoring  to  be 

helpful,  especially  when  it  was  done  with  sensitivity. 

This  perception  of  helpfulness  was  confirmed  by  those  tapes 


173 


and  logs  which  showed  some  readers  either  consulting  the 
table  leaders  about  problematic  papers  or  discussing  larger 
writing  principles.  The  table  leaders  never  insisted  their 
scores  were  right;  rather  they  tended  to  discuss  the 
qualities  in  the  papers  on  which  they  had  based  their 
scores.  In  fact,  several  readers  indicated  on  their 
questionnaires  that  they  felt  free  to  disagree  with  the 
table  leaders .  Many  readers  perceived  the  monitoring  as  a 
resource:  For  example,  not  only  did  one  reader  comment 
during  the  unmonitored  scoring  that  she  would  have  turned 
to  a  table  leader  if  she  could  have,  but  another  reader 
also  wrote  on  her  questionnaire  that  "the  monitoring 
process  makes  the  effort  a  collegial  attempt  to  establish 
and  share  certain  standards  and  values  among  professional 
colleagues,  and  the  students  benefit  ultimately  from  that." 

Discussion 

As  revealed  through  this  study,  monitoring  comprises 
both  a  source  of  guidance  and  a  springboard  for  discussions 
about  writing  principles.  Hence,  far  from  rendering  this 
evaluation  approach  invalid,  the  recalibration  in  holistic 
scoring  indeed  appears  to  serve  as  a  re-creation  of  what 
White  (1985)  calls  an  "assenting  community"  similar  to 
the  "interpretive  community"  discussed  by  the  reader 
response  theorist  Fish  (1980).  In  the  case  of  a  scoring, 
the  community  is  comprised  of  chief  readers,  table  leaders, 


174 


and  the  readers  themselves;  through  their  individual 
interactions  with  table  leaders  and  their  group  tallying  of 
samples  and  rangef inders,  the  participants  in  a  scoring 
negotiate  their  individual  responses  to  student  texts  in 
accordance  with  a  framework  of  standards  they  not  only 
recognize  but  also,  and  more  importantly,  adopt  as  their 
own. 

Of  course,  the  readers  in  the  unmonitored  condition 
of  this  study  also  comprised  an  "assenting  community"  to 
the  extent  that  they  had  internalized  the  standards  as 
their  agreement  with  one  another  on  many  essays  indicates. 
However,  what  the  unmonitored  condition  does  not  provide 
for  is  the  opportunity  for  readers  to  discuss,  to  share, 
to  debate,  to  determine  in  the  words  of  Fish  (1980)  "the 
interpretive  strategies"  (p.  171)  they  will  use  in 
responding  to  the  texts . 

Obviously,  reader  response  theory  cannot  be  applied  too 
extensively  to  the  assessment  context,  in  which  the  very 
purpose  for  reading — that  is,  the  evaluation  of  student 
essays — differs  from  the  purposes  involved  in  reading 
literary  or  informational  material.  Student  writers  of 
assessment  essays  are  not  likely  to  be  consciously  helping 
to  mold  the  interpretive  community  in  the  manner  Fish 
implies  that  authors  do;  neither  are  student  assessment 
essays  truly  reflective  of  the  informational  texts  on  which 
efferent  transactions  are  based  and  which,  according  to 


175 


Rosenblatt,  require  some  consensus  among  readers.  Never- 
theless, the  stance  holistic  readers  must  adopt  falls 
toward  the  "predominantly  efferent"  or  public  end  of  the 
continuum  as  readers  attempt  to  reconstruct  meaning  through 
what  Rosenblatt  refers  to  as  the  extracting  and  ordering 
of  the  ideas  to  be  retained  and  used  afterward  (Rosenblatt, 
1988,  p.  5).  In  this  context,  the  purpose  is  the  assigning 
of  a  score  to  each  student  essay.  The  readers  become 
engaged  to  varying  degrees  with  these  student  texts  and, 
as  one  special  reader  noted,  some  become  more  humane  in 
the  process. 

Notwithstanding  the  necessarily  limited  application 
of  reader  response  theory  to  holistic  scoring,  this  study 
certainly  supports  the  reliability  of  holistic  scoring  as 
a  means  of  writing  evaluation;  it  supports  the  validity  of 
holistic  scoring  as  well.  That  is,  the  tapes  and  logs, 
while  admittedly  underscoring  the  readers'  individuality, 
also  illustrate  that  readers'  responses  were  clearly 
affected  during  both  scoring  conditions  by  substantive 
criteria.  Moreover,  the  monitoring  that  entailed  discus- 
sions of  writing,  the  readers'  stated  willingness  to 
disagree  upon  occasion  with  the  table  leaders  or  chief 
readers,  the  readers'  universal  perception  of  group 
congeniality  in  the  process,  and  the  use  of  operational 
definitions  that  emanated  inductively  from  actual 
essays  support  the  commonality  of  the  criteria  and  the 


176 


criteria-selection  process  used.  Together,  these  convey 
the  image  of  holistic  scoring  as  a  vital,  interpretive 
enterprise  in  which  readers  attempt  to  apply  standards 
to  actual  essays  and  eagerly  seek  guidance  in  troublesome 
cases . 

Recommendations  for  Research 

Certainly,  the  limitations  with  the  study  cannot  be 
overlooked:  Not  only  was  the  study  limited  to  a  select 
group  of  highly  experienced  scorers  evaluating  a  small 
percentage  of  expository  essays,  but  also  these  scorers 
had,  to  quote  from  Diederich  et  al.  (1961),  bought  into 
"the  party  line"  (p.  10)  in  that  they  endorsed  the 
standards,  as  well  as  the  monitoring  procedures.  The 
methods  used  for  the  study  may  have  further  contributed 
to  the  limitations.  That  is,  the  practice  of  recording 
comments  either  on  audiotape  or  in  logs — a  practice  new  to 
these  participants — may  have  made  them  more  conscious  of 
individual  writing  traits  than  is  customary  in  most 
holistic  scorings.  In  fact,  Reader  3A  stated  outright  that 
taping  his  responses  orally  was  different  from  scoring  the 
essays  silently  in  that  he  had  to  find  labels  for  the 
various  and  often  complex  problems  he  perceived  in  some 
papers;  similarly,  Reader  1C  commented  aloud  that 
identifying  the  strengths  or  weaknesses  in  particular 
essays  for  the  purpose  of  maintaining  a  log  was  sometimes 
hard  to  do. 


177 


Still  other  limitations  may  arise  from  the  subjectivity- 
involved  in  the  self-reporting  required  of  all  participants 
by  the  questionnaire.  Subjectivity  was  also  entailed  in 
the  interpretations  the  researcher  needed  to  make  in 
categorizing  the  readers'  written  responses  onto  a  database 
system.  To  be  sure,  the  impact  of  this  subjectivity  was 
checked  by  the  random  validation  of  20%  of  these  logs  by 
an  outside  expert.  Nevertheless,  this  research  needs  to  be 
replicated  with  monitored  and  unmonitored  scorings 
conducted  in  which  no  new  methodologies  are  introduced. 
Such  a  study  should  include  a  less  experienced  group  of 
holistic  readers  in  that  scorers  generally  bring  a  range 
of  scoring  experience  to  actual  assessments,  and  the  study 
would  therefore  reflect  more  typical  scorings.  Moreover, 
additional  studies  could  include  a  broader  scoring  scale — 
e.g.,  a  scale  of  six  or  eight  points — to  determine  if  the 
additional  scoring  levels  make  the  task  of  balancing 
strengths  and  weaknesses  in  each  paper  easier  to  determine. 

The  study  raises  a  number  of  other  writing  issues 
requiring  further  research.  As  noted  in  the  literature 
review,  Barritt,  Stock,  and  Clark  (1986)  found  the  readers' 
perceptions  of  the  freshman  writers  behind  the  essay  to 
have  an  impact  on  their  scoring  judgments  of  placement 
papers.  The  findings  of  this  study  also  suggest  that 
readers'  awareness  of  the  writer  behind  the  essay — 
especially  in  terms  of  either  the  writer's  voice  or  the 


178 


writer's  ideas — can  affect,  to  varying  degrees,  some 
evaluations  of  competency  essays.  As  this  finding  was  a 
corollary  of  other  emphases  in  the  study,  this  variable 
should  be  explored  more  fully  under  controlled  conditions. 
A  second  issue  is  the  criteria  that  scorers  use  in 
evaluating  writing.  The  readers  in  this  study  clearly 
demonstrated  that  they  often  based  their  judgments  on 
substantive  elements  of  writing,  just  as  Diederich  et  al. 
(1961),  Freedman  (1979),  Breland  and  Jones  (1984),  and 
Huot  (1988)  concluded  from  their  studies.  This  finding 
contrasted  with  the  importance  that  mechanics  played  in 
studies  by  Harris  (1977),  Rafoth  and  Rubin  (1984),  and 
Stach  (1987).  At  the  same  time,  readers  in  this  study  did 
not  record  in  their  logs  all  the  features  they  rated  as 
being  "important"  or  "very  important"  to  them  on  their 
questionnaires;  conversely,  some  features  that  several 
rated  as  being  only  "somewhat  important"  did  appear  in 
their  logs  and  on  their  tapes.  Because  of  these  discrep- 
ancies, factors  involved  in  the  judging  of  writing  continue 
to  be  a  research  issue. 

In  addition,  the  potential  benefits  and  drawbacks  of 
the  taped  protocols  used  in  this  study  should  be  more 
closely  examined.  The  special  readers'  taped  responses  to 
the  essays  appear  to  reflect  what  Elbow  (1986)  endorses  as 
an  evaluative  technique — namely,  providing  writers  with  the 
"movies  of  the  mind  of  the  observer"  (p.  181).   Yet,  at  the 


179 


same  time,  as  Freedman  and  Calfee  (1983)  and  Martin  (1987) 
note,  problems  often  exist  in  articulating  such  processes 
as  the  evaluation  of  writing.  As  was  noted  previously  in 
this  study,  whereas  Reader  3D  found  that  taping  his 
comments  made  him  a  more  humane  reader,  Reader  3A  found  it 
easier  to  verbalize  on  his  tapes  the  definable  elements  of 
writing  rather  than  those  features  less  easily  labeled. 

Additional  research  is  also  needed  to  explore  parallels 
between  the  taping  of  writing  evaluations  and  the  confer- 
encing technigues  that  are  practiced  in  writing  centers  and 
classrooms.  For  example,  just  as  the  taped  protocols  of 
this  study  showed  the  special  readers  speculating  as  to  the 
causes  and  probable  remedies  for  errors,  so,  too,  have 
conferences  been  advocated  (Kroll  &  Schafer,  1978)  as  a 
means  of  diagnosing  students'  individual  problems.  More 
significantly,  as  both  the  taped  protocols  of  writing 
evaluations  and  conferences  can  be  used  to  dramatize  for 
students  a  reader's  ongoing  response  to  a  composition, 
research  into  prospective  links  between  these  two 
approaches  may  help  to  bridge  the  gap  between  writing 
assessment  and  writing  instruction. 

Conclusion 

Holistic  scoring  is  not  an  exact  or  guantifiable 
science,  nor  can  any  comprehensive  evaluation  of  writing 
be  truly  precise.   Some  subjectivity  is  usually  involved. 


180 


However,  because  of  this  subjectivity,  monitoring — when 
sensitively  done — increases  reliability  by  encouraging 
readers  to  sublimate  their  own  criteria  in  the  larger 
interest  of  commonly  adopted  standards.  The  holistic 
scoring  approach  as  shown  in  this  study  conveys  a  mutuality 
that  encompasses  the  individuality  of  each  reader's 
perspective  while,  at  the  same  time,  endorsing  group 
criteria  in  the  interest  of  fairness  toward  the  students. 
Monitoring,  as  an  integral  part  of  this  process,  helps  to 
maintain  a  unified  community  of  readers  who  willingly  seek 
to  respond  at  the  highest  reading  level,  the  evaluative,  to 
the  whole  of  each  writer's  text. 


APPENDIX  A 
FORMS  AND  QUESTIONNAIRE 


W.  Wolcott 


QUESTIONNAIRE  ON  HOLISTIC  SCORING 

The  following  questionnaire  contains  several  parts.   Please  answer  each  question  as  honestly  as 
possible,  and  add  any  comments  you  wish.   All  responses  will  remain  confidential. 


Scorer  number 


How  long  have  you  taught  composition/English?_ 


How  long  have  you  been  doing  holistic  scoring? 

Have  you  holistically  scored  exams  other  than  CLAST? 

What  is  your  most  common  role  during  an  holistic  scoring  for  CLAST  (e.g.,  table  leader, 

reader ) ? 

PART  I: 

Please  check  the  importance  of  the  following  criteria  to  you  in  judging  any  TIMED  writing: 

(4)         (3)  (2)  (1) 

Very  Somewhat  Not  Very 

Important  Important  Important  Important 

1.  Adequate  controlling  idea                      

2 .  Focus  throughout  the  paper                      

3.  Depth  of  thought                              

4.  Development  of  ideas                            

5.  Organization  of  ideas                         

6 .  Adequate  introduction                         

7.  Adequate  conclusion                           

8.  Commitment  of  writer  to  the  topic              

9.  Unity  and  coherence                           


10.  Appropriateness  of  tone 

11.  Creativity 

12.  Fluency  of  sentence  style 

13.  Variety  of  sentence  structure 

14.  Accuracy  of  diction 

15.  Maturity  of  diction 

16.  Avoidance  of  such  sentence  errors  as 
fragments  and  run-on  sentences 


182 


183 


17.  Avoidance  of  tangled  sentences 

18.  Avoidance  of  usage  errors 

19.  Avoidance  of  dialect  errors 

20.  Avoidance  of  ESL  errors 


(4)         (3)  (2)         (1) 

Very  Somewhat     Not  Very 

Important   Important    Important   Important 


21.  Accuracy  of  spelling 

22.  Accuracy  of  punctuation 

23.  Accuracy  of  basic  capitalization 

24.  Adequate  length 

Other  


How  does  your  evaluation  of  essays  that  are  timed  differ  from  your  evaluation  of  papers  that  are 
written  outside? 


25.  All  of  us  bring  some  personal  biases  and/or  preferences  to  our  readings.  If  you  find  yourself 
reacting  strongly  in  a  positive  way  to  any  of  the  following  items,  please  place  a  plus  (+)  in  tne 
line  next  to  the  item.   If  you  react  negatively  to  any  of  the  items,  please  place  a  minus  (-)  in  the 


line  next  to  the  item. 

a)  Political  papers 

b)  Controversial  social  issues 

c)  Religious  papers 

d)  First  person  narratives 

e)  Technical/scientific  papers 

f)  Papers  with  literary  allusions 

g)  Unusually  creative  papers 
h)  Misinformation  in  papers 
i )  Humor 

j  )  Severe  misspellings 
k)  Shallow  papers 


1)  Rhetorical  devices,  such  as  questions 

m)  A  disagreeable  writer  behind  the  paper 

n)  Hard-to-read  handwriting 

o)  A  delightful  writer  behind  the  paper 

p)  Extremely  short  papers 

q)  A  weak  conclusion  or  introduction 

r)  Inductively  written  papers 

s)  Sentimental  papers 

t)  Slang 

u)  Others  , 


184 


When  you  encounter  papers  that  trigger  a  strong  personal  response,  how  do  you  handle  them? 

(4)       (3)         (2)        (1) 
Often    Sometimes    Seldom    Never 


(6) 

(5) 

Almost 

Always 

Always 

26.  Slow  down  your  reading 

27.  Re-read  the  paper 

28.  Consult  with  the  t.l.  or  c.r. 

29.  Score  the  same  as  other  papers 

30.  Re-examine  the  rangefinders 
Other  approaches  


PART  III i  ... 

Please  check  the  appropriate  category  below  which  best  characterizes  your  attitudes  toward  writing 
and  the  holistic  scoring  process.  Although  the  most  recent  scoring  should  influence  your  responses, 
the  questions  are  concerned  with  your  actions  and  attitudes  in  general  to  holistic  scoring. 

(6)       (5)       (4)      (3)         (2)       (1) 
Almost 
Always   Always   Often   Sometimes   Seldom  Never 

31.  In  view  of  the  process-product  dichotomy 
in  composition,  do  you  endorse  the 
evaluation  of  written  products?  


32.  Do  you  use  timed  writings  in  your 
classes? 

33.  Do  you  use  holistic  scoring  at  some  point 
to  evaluate  your  classroom  papers? 


34.  Do  you  endorse  the  concept  of  scoring 
papers  as  a  whole? 


35.  Are  the  group  standards  for  CLAST 
compatible  with  those  you  might  use 
in  your  own  classes? 
If  not,  how  do  they  differ? 


36.  Do  you  have  difficulty  adhering  to  group 
standards  in  an  holistic  scoring? 


37 .  Do  you  have  expectations  of  what  a  typical 
paper  from  a  CLAST  student  should  be  like? 


38.  Is  it  easier  to  score  essays  holistically 
in  a  structured  setting  than  at  home? 

39.  Does  being  "off"  in  the  scoring  of  a 
sample  essay  shake  your  confidence? 


185 


(6)       (5)      (4)      (3)       (2)      (1) 
Almost 
Always   Always   Often  Sometimes  Seldom  Never 


40.  Do  you  find  it  helpful  to  consult  with  the 
table  leader  (or  chief  readers,  if  you  are 
functioning  as  a  table  leader)  on  problem 
papers? 

41.  Do  you  feel  free  to  disagree  with  table 
leaders/chief  readers  if  you  consider 
them  wrong? 


42 .  When  a  paper  is  returned  to  you  for 

re-reading,  does  that  procedure  affect  your 
subsequent  scoring  of  the  next  papers? 


43.  Are  you  able  to  separate  your  task — e.g., 
the  assignment  of  a  particular  score — from 
the  consequences  of  that  score? 


Does  knowing  that  your  essay  responses  are 
checked  bother  you  in  any  way? 


Do  you  feel  pressured  by  the  speed  of  other 
readers  at  your  table  to  read  more  essays? 


46 .  Does  your  physical  comfort — in  terms  of  room 
temperature,  time  of  day,  or  seating 
arrangements — affect  your  scoring  in  any  way? 
If  so,  please  explain: 


47 .  Do  you  feel  comfortable  with  the  standards 
that  have  been  adopted  for  CLAST? 


48.  Does  discussing  sample  papers  at  regular 
intervals  help  to  keep  you  aware  of  group 
standards? 


49.  Do  you  re-examine  either  the  rangefinders  or 
the  operational  definitions  if  you  need  to 
realign  your  scoring  standards? 


50.  When  you  encounter  a  problem  paper,  do  you 

find  yourself  examining  it  almost  analytically? 


51.  Does  your  perception  of  the  writer  behind 
the  paper  affect  your  scoring? 


52.  Can  your  discussions  on  essays  with  other 

readers,  table  leaders,  or  chief  readers,  be 
characterized  as  collegial? 


From  past  scorings,  how  would  you  characterize  your  scoring  tendencies?  Are  you  generous,  strict, 
or  fair?   High  or  low?   Please  comment  briefly: 


186 


PART  IV: 

The  following  two  questions  are  open-ended.   Please  respond  as  frankly  as  you  can. 

53.  How  do  you  arrive  at  your  holistic  score?  Do  you  make  an  immediate  judgment  as  to  score  level, 
or  do  you  read  top-down,  lowering  the  score  as  you  encounter  problems?  Or  do  you  first  decide  on 
the  half  that  the  paper  falls  into  before  you  decide  on  the  actual  score? 


54.  Please  comment  briefly  from  the  perspective  you  know  best  (e.g.,  reader,  table  leader,  chief 
reader)  on  how  the  monitoring  in  a  formal  holistic  scoring  appears  to  affect  the  interaction  between 
the  reader,  the  essay,  and  the  holistic  scoring  process. 


PART  V: 

If  you  have  often  served  in  the  capacity  of  chief  reader  or  table  leader,  would  you  please  answer 

the  following  questions: 

(6)       (5)       (4)       (3)         (2)       (1) 
Almost 
Always   Always    Often    Sometimes    Seldom   Never 

55.  When  you  are  given  problem  papers  to 
review,  do  you  read  them  holistically 
AND  analytically?  


56.  Do  you  see  part  of  your  role  as  that 
of  arbitrating  standards? 


57.  Does  disagreement  with  a  reader's 

score  (or  table  leader's  score)  ever 
cause  you  to  reconsider  your  own 
judgment? 


When  refereeing  papers,  can  you 
separate  your  task  of  assigning 
a  score  from  the  consequences 
that  your  score  will  have  on 
a  student? 


In  refereeing  papers  with  non- 
contiguous scores  (as  opposed 
to  2/1  papers),  do  you  find  it 
hard  to  identify  which  score  is 
the  more  discrepant? 


60.  Do  you  find  it  difficult  to  deal 
with  a  reader  who  is  unwilling  to 
alter  the  score  at  your  suggestion? 


61.  Do  you  believe  that  monitoring  is 
effective  in  helping  the  group  to 
adhere  to  group  standards? 


187 


PILOT   TEST    OF   QUESTIONNAIRE 


Please  try  to  complete  the  following  questionnaire,  and 
respond  to  the  questions  below: 


1.   Are  the  directions  for  each  section  clear? 


2.   Is  the  wording  of  the  individual  items  clear? 


3.   Should  any  portions  or  items  on  the  questionnaire  be  taken 
out?   If  so,  please  indicate  below  which  items  and  why. 


Have  any  important  issues  in  holistic  scoring  been  omitted 
from  the  questionnaire?  Please  indicate  below  items  you 
think  should  have  been  included. 


Does  this  questionnaire  seem  fair  and  comprehensive  to 
you?  Would  you  be  able  to  answer  it  readily  after 
participating  in  an  holistic  scoring? 


6.   Do  you  have  any  other  concerns  about  the  questionnaire? 


188 


SAMPLE  OF  A  DATA  BASE  LOG  ENTRY 


THE  RECORD  NUMBER  IS 

READER  ID 

PACKET  NO 

MON  OR  UN 

PAPER  ID 

SCORE 

TOPIC 

QUAL  IDEA 

COMM  QUAL 

FOCUS 

COMM  FOCUS 

DEVELOPMT 

COMM  DEVEL 

ORG  STRUCT 

COMM  ORG 

STYLE  TONE 

COMM  STYLE 

APPROACH 

COMM  APPR 

DICTION 

COMM  DICTION 

SENT  STRUC 

COMM  STRUC 

MECH  PROB 

COMM  MECH 

SENT  ERROR 

COMM  SENT 
USAGE 

COMM  USAGE 

DIAL  ESL 

COMM  DIAL 

PUNCT  CAPS 

COMM  PUNCT 

SPELLING 

COMM  SPELL 

LENGTH 

COMM  LGTH 

HANDWRITING 

COMM  HAND 

WRITR  ROLE 

COMM  WRITR 

OVERALL 

COMM  OVRLL 

SCR  B  SMP 

SCR  A  SMP 

COMM  SMP 

SCR  B  TBLL 

SCR  A  TBLL 

COMM  TBLL 

B  CHK  READ 

A  CHK  READ 


213 


2A 

V 

M 

005 

3 

CERTIFIED  NURSE'S  ASSISTANT 

+ 

VERY    CONCRETE    ESSAY 


MANY    GOOD    EXAMPLES 


AWK.     SENT.    STRUCT.    AT   TIMES 


SOME    ERRORS 


TAI 


LEADER    READ   AMD    AGREE! 


189 


CD 
T3 
O 
O 

u 

Q) 

■d 

(0 

0) 
PS 


o 

M 

Pi 
o 
u 

o 
o 


01 

n 

3 
T) 

m 
u 

0 

u 

(U 

& 

0 

-p 

(D 
(0 

c 

0 

a< 

CO 

o 

+J     Q) 

m  >-i 

•H    0 
.-I    0 

0  w 

K 

u 

a) 

■p 

m  +j 

0  (d 

:s 

0) 

V4   -P 
3   U 
+J   0) 

(fl    T-l 
Z    XI 

3 
W 

ft-S 

S  § 

Paper 
Number 

190 


Descriptions  of  the  Levels  of  the  CLAST  Ratings 

Score  of  4:  Writer  purposefully  and  effectively  develops  a  thesis. 
Writer  uses  relevant  details,  including  concrete 
examples,  that  clearly  support  generalizations. 
Paragraphs  carefully  follow  an  organizational  plan  and 
are  fully  developed  and  tightly  controlled.  A  wide 
variety  of  sentences  occur,  indicating  that  the  writer 
has  facility  in  the  use  of  language,  and  diction  is 
distinctive.  Appropriate  transitional  words  and  phrases 
or  other  techniques  make  the  essay  coherent.  Few  errors 
in  syntax,  mechanics,  and  usage  occur. 

Score  of  3:  Writer  develops  a  thesis  but  may  occasionally  lose  sight 
of  purpose.  Writer  uses  some  relevant  and  specific 
details  that  adequately  support  generalizations. 
Paragraphs  generally  follow  an  organizational  plan  and 
are  usually  unified  and  developed.  Sentences  are  often 
varied,  and  diction  is  usually  appropriate.  Some 
transitions  are  used,  and  parts  are  usually  related  to 
each  other  in  an  orderly  manner.  Syntactical, 
mechanical,  and  usage  errors  may  occur  but  usually  do 
not  affect  clarity. 

Score  of  2:  Writer  may  state  a  thesis,  but  the  essay  shows  little, 
if  any,  sense  of  purpose.  Writer  uses  a  limited  number 
of  details,  but  they  often  do  not  support 
generalizations.  Paragraphs  may  relate  to  the  thesis 
but  often  will  be  vague,  underdeveloped,  or  both. 
Sentences  lack  variety  and  are  often  illogical,  poorly 
constructed,  or  both.  Diction  is  pedestrian. 
Transitions  are  used  infrequently,  mechanically,  and 
erratically.  Numerous  errors  may  occur  in  syntax, 
mechanics,  and  usage  and  frequently  distract  from 
clarity. 

Score  of  1:  Writer's  thesis  and  organization  are  seldom  apparent, 
but,  if  present,  they  are  unclear,  weak,  or  both.  Writer 
uses  generalizations  for  support,  and  details,  when 
included,  are  usually  ineffective.  Underdeveloped, 
ineffective  paragraphs  do  not  support  the  thesis. 
Sentences  are  usually  illogical,  poorly  constructed, 
or  both.  They  usually  consist  of  a  series  of  subjects 
and  verbs  with  an  occasional  complement.  Diction  is 
simplistic  and  frequently  not  idiomatic.  Transitions 
and  coherence  devices,  when  discernible,  are  usually 
inappropriate.  Syntactical,  mechanical,  and  usage 
errors  abound  and  impede  communication. 


APPENDIX  B 
SCORING  CHARTS  AND  ADDITIONAL  FIGURES 


Associate  Chief  Reader's  Log 
January  7,  1989 

8:30  Comments  about  the  study  procedures 

8:34  Orientation  by  the  chief  reader 

8:35  Rangefinders  D,  I,  M,  T,  W,  LL 

8:50  Break 

9:07  Samples  FF,  0 

9:15  Samples  JJ,  CC 

9:22  Sample  E 

9:24  "Live"  papers 

10:00  Begin  check  reading  #1 

10:04  Break 

10:16  Samples  N,  U 

10:23  "Live"  papers 

11:13  Break 

11:30  Samples  BB,  V 

11:27  "Live"  papers 

11:48  Begin  check  reading  #2 

12:21  Lunch 

1:30  Samples  X,  P 

1:36  "Live"  papers 

2:26  Break 

2:30  "Live"  papers 

3:40  End  of  reading 


Note:   Total  reading  time:   254  minutes  =  4  hours,  23  minutes 


Figure  B-l.   Account  of  procedures, 


192 


193 


Category 


Reader  1A 
Score  2 


Reader  IB 
Score  1 


Reader  1C 
Score  2 


Reader  ID 
Score  2 


Quality  of 
Ideas 


Thesis  clear/3  parts 


Development 


Thin  on  development   Paragraphs  not  dev. 


Needs  more  dev.  with     Minimal  develop- 
specifics  ment 


Organization 


Organization  poor 


Style/Tone 


Approach 


Sentence 
Structure 


Mechanical 
Problems 


Sentence 
Errors 


Usage  Errors 


Dialect/ESL 
Errors 


Punctuation/ 
Capitalization 


Spelling 


Length 


Handwriting 


Writer's  Role 


Overall  Comment 


Several  wrong  words      Element, 
sentences 


Figure   B-2.       Summary   of   comments    for   paper    03  4 


194 


Reader  2A 
Score  2 


Reader  2B 
Score  2 


Reader  2C 
Score  2 


Reader  2D 
Score  1 


Quality  of 
Ideas 


Weak  content         Well  thought  out 


Content  terrible 


Development 


Develops  thesis         Not  very  developed      Does  not  support 

points 


Organization 


Organization  OK 


Style/Tone 


Approach 


Sentence 
Structure 


Awkward  syntax       Errors  in  syntax 


Mechanical 
Problems 


Numerous  errors 


Language  skills  not 
strong 


Sentence  Errors 


Usage  Errors 


Dialect/ESL 
Errors 


Punctuation/ 
Capitalization 


Spelling 


Errors  in  spelling 


Length 


Handwriting 


Writer's  Role 


Overall  Comment 


Figure  B-2. — (Continued) 


Organization 


Problems  in 
introduction 


Style/Tone 
Approach 


Diction  is  a 
problem 


Mechanical 
Problems 


Sentence  Errors 


Dialect/ESL 
Errors 


195 


Reader  3A 
Score  2 


Reader  3B 
Score  2 


Reader  3C 
Score  2 


Reader  3D 
Score  2 


Quality  of 
Ideas 


Good,  supported 
thesis 


Not  specific  enough 


First  par.  all 
thesis 


Development 


No  sense  of 
detail 


Sentence 
Structure 


Poor  sentence 
logic 


Some  awkwardness 


Competent  sent, 
structure 


Usage  Errors 


Unclear  pronoun 
reference 


Vague  pronoun 
reference 


Unclear  pron. 
reference 


Punctuation/ 
Capitalization 


Spelling 


Misspellings 


Incorrect  spelling 
of  lured 


lured,  field,  though,    Misspellings 
knowledge 


Length 


Handwriting 


Writer's  Role 


Overall  Comment 


Good,  acceptable 
paper 


Paper  barely  makes 
sense  with  vague 
pronoun  reference 


Figure  B-2. — (Continued) 


Development 


Much  detail 


Good  supporting 
details 


Organization/ 
Structure 


OK  except  for  weak 
conclusion 


Style/Tone 


Good  paragraph 
unity 


Approach 


Conjecture 


Mechanical 
Problems 


Not  many  basic 
errors 


Sentence 
Errors 


Dialect/ESL 


Punctuation/ 
Capitalization 


Spelling 


Length 


Some  words  omitted 


Handwriting 


Difficult  to  read 
in  places 


Writer's  Role 


196 


Reader  1A 
Score  3 


Reader  IB 
Score  3 


Reader  1C 
Score  2 


Reader  ID 
Score  3 


Quality  of 
Ideas 


Problems  in  logic 


A  weak  argument 


Sentence 
Structure 


Some  diction 
problems 


Figure   B-3 .       Summary   of   comments    for   paper    088 


Development 


Some  generalities 


Organization/ 

Structure 


Approach 


Sentence 
Errors 


Comma  splice 


Usage 


Many  errors  in 
usage 


Dialect/ESL 


Punctuation/ 
Capitalization 


Needs  to  review 
commas 


Spelling 


Length 


Handwriting 


Writer's  Role 


Difficult  to  follow 


Figure   B-3 . — (Continued) 


197 


Reader  2A 
Score  2 


Reader  2B 
Score  2 


Reader  2C 
Score  3 


Reader  2D 
Score  3 


Quality  of 
Ideas 


Content  seemingly  OK 


Content  good 


Style/Tone 


Some  awkward 
phrasing 


Some  awkward 
phrasing 


Awkward  word 
choice 


Wordiness 
hinders  style 


Sentence 
Structure 


Mechanical 
Problems 


Awkward  sentence 
structure  hurts 
content 


Awkward  sentence 
structure 


Not  many  basic 
errors 


A  few  language 
errors 


Mechanical 
Problems 


198 


Category 


Reader  3A 
Score  2 


Reader  3B 
Score  3 


Reader 
Score 


Reader  3D 
Score  2 


Quality  of 
Ideas 


Lack  of  certain 
clear  logic 


Thesis  doesn't 
follow  from 
introduction 


Development 


Numerous  but 
innocuous  details 


There  is  development 
but  some  developing 
is  not  well  done 


Generalizations 
too  long,  need 
specifics 


Organization/ 
Structure 


Terse  conclusion 


Weak  conclusion 


Style/Tone 


Sophisticated  but 
incorrectly  done 


Approach 


Lists  aren't  effective 


"Ef fectivity" 
horriblel 


Diction  isn't 
clear,  isn't  apt 


Sentence 
Structure 


Horrible  use  of 
passive 


A  good  sentence 


Sentences  are  often 
awkward 


Sentence 
Errors 


Run-on  sentences 


Dialect/ESL 


Verbs/verb  tenses 


Punctuation/ 
Capitalization 


Needs  to  review 
commas 


Spelling 


Use  of  than  when 
writer  means  then 


Length 


Handwriting 


Difficult  to  read 
copy 


Xerox  copy  poor 


Difficult  to  read 
copy 


Writer's  Role 


Student  needs 
help  with 
conditionals 


Writer  knows  what  he 
wants  to  say  and  is 
saying  it  relatively 
well 


Paper  is 
shallow 


Evidence  of 
thinking,  of 
revision 


Figure  B-3 . — (Continued) 


199 


Scores  of      Scores  of       Scores  of 
Rangef inders   Table  Leader  1   Table  Leader  2   Table  Leader  3 


D 

2 

2 

2 

I 

4 

4 

4 

M 

2 

2 

2 

T 

1 

1 

1 

W 

3 

3 

3 

LL 

1 

1 

2 

Samples 

A 

2 

2 

1 

B 

2 

2 

1 

C 

3 

3 

2 

E 

2 

3 

3 

F 

1 

2 

1 

G 

1 

1 

1 

H 

1 

1 

1 

J 

1 

1 

1 

K 

3 

2 

3 

L 

3 

3 

3 

N 

2 

2 

2 

0 

1 

1 

1 

P 

4 

3 

4 

Q 

3 

3 

3 

R 

2 

2 

2 

S 

2 

2 

2 

u 

2 

3 

3 

V 

4 

4 

4 

X 

2 

3 

4  **spl 

Y 

4 

4 

4 

Z 

1 

1 

1 

AA 

3 

4 

3 

BB 

1 

2 

1 

CC 

4 

3 

4 

DD 

2 

2 

2 

EE 

3 

3 

3 

FF 

3 

2 

2 

GG 

1 

1 

1 

HH 

2 

2 

2 

II 

2 

2 

2 

JJ 

2 

2 

3 

KK 

2 

2 

2 

MM 

1 

2 

1 

Figure  B-4.   Results  of  table  leaders1  independent  scoring 
of  training  samples. 


REFERENCES 


Allen,  C.  L.  (1976).  A  study  of  the  effect  of  selected 
mechanical  errors  on  teachers '  evaluation  of  the  non- 
mechanical  aspects  of  students'  writing.  Dissertation 
Abstracts  International,  37,  09A  (p.  5554-A) . 
(University  Microfilm  76-15,  827) 

Bamberg,  B.  (1982).  Multiple-choice  and  holistic  essay 
scores :  What  are  they  measuring?  College  Composition 
and  Communication,  33,  404-406. 

Barritt,  L. ,  Stock,  P.  L.  ,  &  Clark,  F.  (1986). 
Researching  practice:  Evaluating  assessment  essays. 
College  Composition  and  Communication,  37,  315-327. 

Bauer,  B.  A.  (1981).  A  study  of  the  reliabilities  and  the 
cost-efficiencies  of  three  methods  of  assessment  for 
writing  ability.  Research  at  the  University  of 
Illinois.  (ERIC  Document  reproduction  service  ED  216 
357) 

Berdie,  D.  R. ,  &  Anderson,  J.  F.  (1974).  Questionnaires: 
Design  and  use.  Metuchen,  NJ:  The  Scarecrow  Press, 
Inc . 

Bleich,  D.  (1975).  Readings  and  feelings:  An  introduc- 
tion to  subjective  criticism.  Urbana,  IL:  National 
Council  of  Teachers  of  English. 

Braddock,  R.  ,  Lloyd-Jones,  R.  ,  &  Schoer,  L.  (1963). 
Research  in  written  composition.  Urbana,  IL: 
National  Council  of  Teachers  of  English. 

Breland,  H.  M.  ,  Camp,  R.,  Jones,  R.  J.,  Morris,  M.  M. ,  & 
Rock,  D.  A.  (1987).  Assessing  writing  skill. 
(Research  Monograph  11).  New  York:  College  Entrance 
Examination  Board. 

Breland,  H.  M. ,  &  Jones,  R.  J.  (1984).  Perceptions  of 
writing  skills.   Written  Communication,  1,  101-119. 

CCCC  Task  Force  on  the  Preparation  of  Teachers  of  Writing 
(1982).  Position  statement  on  the  preparation  and 
professional  development  of  teachers  of  writing. 
College  Composition  and  Communication,  33,  446-449. 


200 


201 


Charney,  D.  (1984).  The  validity  of  using  holistic 
scoring  to  evaluate  writing:  A  critical  overview. 
Research  in  the  Teaching  of  English,  18,  65-81. 

Chase,  C.  I.  (1968).  The  impact  of  some  obvious  variables 
on  essay  test  scores.  Journal  of  Educational 
Measurement ,  5,  315-318. 

Collins,  J.  L.,  &  Williamson,  M.  M.  (1984).  Assigned 
rhetorical  context  and  semantic  abbreviation  in 
writing.  In  R.  Beach,  &  L.  S.  Bridwell  (Eds.)  New 
directions  in  composition  research,  pp.  285-296.  New 
York:   The  Guilford  Press. 

Cooper,  C.  R.  ,  &  Odell,  L.  (1977).  Evaluating  writing, 
describing,  measuring,  judging.  Urbana,  IL:  National 
Council  of  Teachers  of  English. 

Davis,  B.,  Scriven,  M. ,  &  Thomas,  S.  (1987).  The 
evaluation  of  composition  instruction  (2nd  ed.).  New 
York:   Teachers  College  Press. 

Diederich,  P.  B.  (1974).  Measuring  growth  in  English. 
Urbana,  IL:   National  Council  of  Teachers  of  English. 

Diederich,  P.,  French,  J.,  &  Carlton,  S.  (1961).  Factors 
in  judgments  of  writing  ability.  Princeton,  NJ: 
Educational  Testing  Service  ED  002172. 

Elbow,  P.  (1986).  Embracing  contraries.  New  York: 
Oxford  University  Press. 

Faigley,  L.,  Cherry,  R.  D.,  Jolliffe,  D.  A.,  &  Skinner, 
A.  M.  (1985).  Assessing  writers'  knowledge  and 
processes  of  composing.   Norwood,  NJ:   Ablex. 

Fish,  S.  (1980).  Is  there  a  text  in  this  class?  The 
authority  of  interpretive  communities .  Cambridge,  MA: 
Harvard  Univ.  Press. 

Follman,  J.  C,  &  Anderson,  J.  A.  (1967).  An 
investigation  of  the  reliability  of  five  procedures 
for  grading  English  themes.  Research  in  the  Teaching 
of  English,  1,  190-200. 

Freedman,  S.  W.  (1979).  How  characteristics  of  student 
essays  influence  teachers'  expectations.  Journal  of 
Educational  Psychology,  71,  (3),  328-338. 


202 


Freedman,  S.  W.  (1981).  Influences  on  evaluators  of 
expository  essays:  Beyond  the  text.  Research  in  the 
Teaching  of  English,  15,  245-255. 

Freedman,  S.  W.  (1984).  The  registers  of  student  and 
professional  expository  writing:  Influences  on 
teachers'  responses.  In  R.  Beach,  &  L.  S.  Bridwell 
(Eds.)  New  directions  in  composition  research  (pp. 
334-347).   New  York:   The  Guilford  Press. 

Freedman,  S.  W. ,  &  Calfee,  R.  C.  (1983).  Holistic 
assessment  of  writing:  experimental  design  and 
cognitive  theory.  In  P.  Mosenthal,  L.  Tamor,  &  S.  A. 
Walmsley  (Eds.),  Research  on  writing:  Principles  and 
methods  (pp.  75-98).   New  York:   Longman. 

Godshalk,  F.,  Swineford,  F.,  &  Coffman,  W.  E.  (1966).  The 
measurement  of  writing  ability.  (Research  Monograph 
No.  6).  New  York:  College  Entrance  Examination 
Board . 

Grobe,  C.  H.  (1981).  Syntactic  maturity,  mechanics,  and 
vocabulary  as  predictors  of  quality  ratings.  Research 
in  the  Teaching  of  English,  15,  75-85. 

Hake,  R.  L.  ,  &  Williams,  J.  M.  (1981).  Style  and  its 
consequences:  Do  as  I  do,  not  as  I  say.  College 
English,  43,  433-451. 

Harris,  W.  H.  (1977).  Teacher  response  to  student 
writing:  A  study  of  the  response  patterns  of  high 
school  English  teachers  to  determine  the  basis  for 
teacher  judgment  of  student  writing.  Research  in  the 
Teaching  of  English,  11,  175-185. 

Haswell,  R.  H.  (1988).  Dark  shadows:  The  fate  of  writers 
at  the  bottom.  College  Composition  and  Communication, 
39,  303-315. 

Hirsch,  E.  D.,  Jr.  (1977).  The  philosophy  of  composition. 
Chicago:   The  Univ.  of  Chicago  Press. 

Hoetker,  J.,  &  Brossell,  G.  (1986).  A  procedure  for 
writing  content-fair  essay  examination  topics  for 
large-scale  writing  assessments.  College  Composition 
and  Communication,  33,  377-392. 

Hrach,  E.  (1983).  The  influence  of  rater  characteristics 
on  composition  evaluation  practices.  Dissertation 
Abstracts  International.  45,  02A  (p.  440). 
(University  Microfilms  No.  DEQ  84-10956) 


203 


Huot,  B.  A.  (1988).  The  validity  of  holistic  scoring:  A 
comparison  of  the  talk-aloud  protocols  of  expert  and 
novice  holistic  raters.  Dissertation  Abstracts 
International ,  49,  08A  (p.  2188).  (University 
Microfilms  No.  DA  8817872) 

Janopoulos,  M.  (1987).  The  role  of  comprehension  in 
holistic  evaluation  of  second-language  writing 
proficiency  at  the  university  level.  Dissertation 
Abstracts  International,  48,  05A  (p.  1137). 
(University  Microfilms  No.  DET  87-17654) 

Keech,  C.  (1982).  Practice  in  designing  writing  test 
prompts:  Analysis  and  recommendations.  In  J.  R. 
Gray,  &  L.  P.  Ruth  (Eds.)  Properties  of  writing 
tasks:  A  study  of  alternative  procedures  for  holistic 
writing  assessment  (pp.  132-214).  University  of 
California,  Graduate  School  of  Education,  Bay  Area 
Writing  Project.  (ERIC  Document  Reproduction  No.  ED 
230  576)  . 

Kroll,  B.  M.  &  Schafer,  J.  C.  (1978).  Error-analysis  and 
the  teaching  of  composition.  College  Composition  and 
Communication,  29,  242-248. 

Marshall,  J.  C.  (1972).  Writing  neatness,  composition 
errors,  and  essay  grades  reexamined.  The  Journal  of 
Educational  Research,  65,  213-215. 

Martin,  W.  (1987).  A  study  of  reader  process  in  the 
evaluation  of  English  placement  essays.  Dissertation 
Abstracts  International,  48,  08A.  (University 
Microfilms  No.  87-24759) 

McColly,  W.  (1970).  What  does  educational  research  say 
about  the  judging  of  writing  ability?  The  Journal  of 
Educational  Research,  64,  148-156. 

Mishler,  C,  &  Hogan,  T.  P.  (1982).  Holistic  scoring  of 
essays:  Remedy  for  evaluating  the  third  R. 
Diagnostigue ,  8,    4-16. 

Murphy,  S.,  Carroll,  K.  Kinzer,  C,  &  Robyns ,  A.  (1982). 
A  study  of  the  construction  of  the  meanings  of  a 
writing  prompt  by  its  authors,  the  student  writers, 
and  the  raters.  In  J.  R.  Gray,  &  L.  P.  Ruth  (Eds.) 
Properties  of  writing  tasks:  A  study  of  alternative 
procedures  for  holistic  writing  assessment  (pp. 
336-471).  University  of  California,  Graduate  School 
of  Education,  Bay  Area  Writing  Project.  (ERIC 
Document  Reproduction  No.  ED  230  576). 


204 


Myers,  M.  (1980).  A  procedure  for  writing  assessment  and 
holistic  scoring.  Urbana,  IL:  ERIC  Clearinghouse  on 
Reading  and  Communication  Skills  and  the  National 
Council  of  Teachers  of  English. 

Neilsen,  L.,  &  Piche,  G.  L.  (1981).  The  influence  of 
headed  nominal  complexity  and  lexical  choice  on 
teachers'  evaluation  of  writing.  Research  in  the 
Teaching  of  English,  15,  65-73. 

Nold,  E.  W.,  &  Freedman,  S.  W.  (1977).  An  analysis  of 
readers'  responses  to  essays.  Research  in  the 
Teaching  of  English,  13,  164-174. 

Paden,  P.  (1986).  The  potential  dual  effect  of  context 
effects  and  score  level  effects  on  the  assignment  of 
scores  to  essays.  (Report  no.  143).  Princeton,  NJ: 
Educational  Testing  Service.  (ERIC  Document 
Reproduction  Service  No.  ED  280  852) 

Paulis,  C.  (1985).  Holistic  scoring:  A  revision 
strategy.   Clearing  House,  59,  57-60. 

Rafoth,  B.  A.,  &  Rubin,  D.  L.  (1984).  The  impact  of 
content  and  mechanics  on  judgments  of  writing  quality. 
Written  Communication,  1,    446-458. 

Raymond,  J.  C.  (1982).  What  we  don't  know  about  the 
evaluation  of  writing.  College  Composition  and 
Communication ,  33,  399-403. 

Roberts,  D.  H.  (1982).  Individualized  writing  instruction 
in  southern  West  Virginia  colleges:  A  study  of  the 
acquisition  of  writing  fluency.  Dissertation 
Abstracts  International.  43.  05A  (p.  1525). 
(University  Microfilms  No.  DD  J82-24102) 

Roberts,  D.  H.  (1983).  Experimental  research  in  written 
composition:  A  critical  view.  (ERIC  Document 
Reproduction  No.  ED  238  006) 

Rosenblatt,  L.  M.  (1985).  The  transactional  theory  of  the 
literary  work:  Implications  for  research.  In  C. 
Coopper,  Ed.  Researching  response  to  literature  and 
the  teaching  of  literature.   Norwood,  NJ:   Ablex. 

Rosenblatt,  L.  M.  (1988).  Writing  and  reading:  The 
transactional  theory  (Tech.  Rep.  No.  13).  Berkeley: 
University  of  California  and  Carnegie  Mellon 
University,  Center  for  the  Study  of  Writing. 


205 


Shaughnessy,  M.  (1980).  Statement  on  criteria  for  writing 
proficiency.   Journal  of  Basic  Writing,  3,  115-119. 

Shoaf,  J.  S.  (1985).  Measuring  gain  in  writing 
proficiency  qualitatively  and  quantitatively. 
Dissertation  Abstracts  International,  46_,  HA 
(p.  3275).   (University  Microfilms  No.  DES  85-27322) 

Spandel,  V.,  &  Stiggins,  R.  (1981).  Direct  measures  of 
writing  skill;  Issues  and  applications  (Revised  ed.) 
Portland,  Or.:  Northwest  Regional  Library.  (ERIC 
Document  reproduction  service  ED  213  035) 

Stach,  C.  L.  (1987).  The  component  parts  of  general 
impressions:  Predicting  holistic  scores  in  college- 
level  essays.  Dissertation  Abstracts  International, 
48,  07A.   (University  Microfilms  No.  DES  87-22706) 

Stewart,  M.  F.,  &  Grobe,  C.  H.  (1979).  Syntactic 
maturity,  mechanics  of  writing,  and  teachers'  quality 
ratings.  Research  in  the  Teaching  of  English,  13, 
207-215. 

Sullivan,  F.  J.  (1986).  Placing  texts,  placing  writers: 
Sources  of  readers'  judgments  in  university  placement- 
testing.  (Technical  Research  Report  143).  (ERIC 
Document  Reproduction  Service  No.  ED  285  177) 

Sweedler-Brown,  C.  0.  (1985).  The  influence  of  training 
and  experience  on  holistic  essay  evaluations.  English 
Journal ,  74  (5),  49-55. 

Westcott,  W.,  &  Gardner,  P.  (1984).  Holistic  scoring  as 
a  teaching  device.  Teaching  English  in  the  Two-Year 
College,  11,  (2),  35-39. 

White,  E.  M.  (1985).  Teaching  and  assessing  writing.  San 
Francisco:   Jossey-Bass. 

Winer,  B.  J.  (1971).  Statistical  principles  in  experi- 
mental design.   New  York:   McGraw-Hill,  pp.  375-378. 

Winters,  L.  (1978).  The  effects  of  differing  response 
criteria  on  the  assessment  of  writing  competence.  Los 
Angeles,  Calif.:  Center  for  the  Study  of  Evaluation. 
(ERIC  Document  Reproduction  Service.   ED  212  659) 


BIOGRAPHICAL  SKETCH 

Willa  Buckley  Wolcott  was  born  and  raised  in  the  White 
Mountains  in  New  Hampshire.  She  graduated  as  a  Wellesley 
College  Scholar  from  Wellesley  College,  Wellesley, 
Massachusetts,  in  1964  with  a  B.A.  in  English,  and  she 
received  her  M.A.  degree  in  English  from  the  University  of 
Denver  in  1969.  She  taught  at  the  secondary  level  in 
Massachusetts  and  Colorado  and  worked  with  adult  education 
programs  in  Denver  and  Connecticut.  She  has  coordinated 
the  Writing  Center  in  the  Office  of  Instructional  Resources 
at  the  University  of  Florida  since  founding  the  Center, 
together  with  a  colleague,  in  1977.  She  serves  as  one  of 
the  chief  readers  for  the  statewide  holistic  scorings  of 
essays  written  for  the  College  Level  Academic  Skills  Test 
and  the  Florida  Teacher  Certification  Examination.  She  has 
had  articles  published  in  the  Writing  Lab  Newsletter,  The 
Writing  Center  Journal,  the  Journal  of  Basic  Writing,  and 
College  Composition  and  Communication. 

She  is  a  member  of  the  National  Council  of  Teachers  of 
English  and  of  the  Pi  Lambda  Theta  and  the  Kappa  Delta  Pi 
honorary  societies.  She  is  married  and  has  two  children 
away  at  college. 


206 


I  certify  that  I  have  read  this  study  and  that  in  my 
opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


(^~L~** 


Ruthellen  Crews,  Chair 
Professor  of  Instruction 
and  Curriculum 


I  certify  that  I  have  read  this  study  and  that  in 
my  opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


Margaret  Early 
Professor  of  Instruction  and 
Curriculum 


I  certify  that  I  have  read  this  study  and  that  in 
my  opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate^ in  sc^pe  and  quality, 
as  a  dissertation  for  the  degree/^!  Docto/:  pf  philosophy. 


Wright 
Associate  Professor 
and  Curriculum 


Instruction 


I  certify  that  I  have  read  this  study  and  that  in 
my  opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


F&is\S2a^F~  VC 


Forrest  Parkay 
Professor  of  Educational 
Leadership 


I  certify  that  I  have  read  this  study  and  that  in 
my  opinion  it  conforms  to  acceptable  standards  of  scholarly 
presentation  and  is  fully  adequate,  in  scope  and  quality, 
as  a  dissertation  for  the  degree  of  Doctor  of  Philosophy. 


,/fc*W 


xhrrtA^y 


Sandra  Damico 
Professor  of  Foundations  of 
Education 


This  dissertation  was  submitted  to  the  Graduate  Faculty 
of  the  College  of  Education  and  to  the  Graduate  School  and 
was  accepted  as  partial  fulfillment  of  the  requirements  for 
the  degree  of  Doctor  of  Philosophy. 

December  1989  lAilJuJ  'rf>  ^rru^-£fr  , 

Dean.  Colleae  of  Education  -" 


Dean,  Graduate  School 


UNIVERSITY  OF  FLORIDA 


3  1262  08285  292  1 


